国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 學院 > 開發(fā)設(shè)計 > 正文

spark | spark 機器學習chapter3 數(shù)據(jù)的獲取、處理與準備

2019-11-08 03:11:04
字體:
供稿:網(wǎng)友

閱讀spark機器學習這本書來學習在spark上做機器學習

注意:數(shù)據(jù)集是電影評分等數(shù)據(jù),下載鏈接:http://files.grouplens.org/datasets/movielens/ml-100k.zip

數(shù)據(jù)集包括:用戶屬性文件、電影元素、用戶對電影的評級 1、將數(shù)據(jù)解壓到某個目錄下,并切換到該目錄

unzip ml-100k.zipcd ml-100k

2、查看上述三種數(shù)據(jù)

用戶 這里寫圖片描述

電影 這里寫圖片描述

評分 這里寫圖片描述

3、啟動python,分析數(shù)據(jù)

啟動

/home/hadoop/spark/bin/pyspark

4、讀數(shù)據(jù)

from pyspark import SparkContextuser_data = sc.textFile("u.user")user_data.first()

u’1|24|M|technician|85711’

5、基本的分析

#分割數(shù)據(jù),函數(shù)splituser_fields=user_data.map(lambda line:line.split("|"))#用戶數(shù)量num_users=user_fields.map(lambda fields:fields[0]).count()#性別數(shù)num_genders = user_fields.map(lambda fields:fields[2]).distinct().count()#職業(yè)種數(shù)num_occupations = user_fields.map(lambda fields:fields[3]).distinct().count()#其他num_zipcodes=user_fields.map(lambda fields:fields[4]).distinct().count()#結(jié)果打印Users:943 ,genders:2 ,occupations:21 ,ZIP codes:795

6、畫圖 對ages這個屬性做直方圖。 由于在終端下沒法畫圖,這里給出代碼

ages = user_fields.map(lambda x: int(x[1])).collect()hist(ages, bins=20, color='lightblue', normed=True)fig = matplotlib.pyplot.gcf()fig.set_size_inches(16, 10)

7、統(tǒng)計職業(yè)的種類和數(shù)量

import numpy as npcount_by_occupation = user_fields.map(lambda fields:(fields[3],1)).reduceByKey(lambda x,y:x+y).collect()x_axis1 = np.array([c[0] for c in count_by_occupation])y_axis1 = np.array([c[1] for c in count_by_occupation])

打印出結(jié)果

print x_axis1

[u’administrator’ u’retired’ u’lawyer’ u’none’ u’student’ u’technician’ u’programmer’ u’salesman’ u’homemaker’ u’executive’ u’doctor’ u’entertainment’ u’marketing’ u’writer’ u’scientist’ u’educator’ u’healthcare’ u’librarian’ u’artist’ u’other’ u’engineer’]

print y_axis1

[ 79 14 12 9 196 27 66 12 7 32 7 18 26 45 31 95 16 51 28 105 67]

y_axis = y_axis1[np.argsort(y_axis1)]

array([ 7, 7, 9, 12, 12, 14, 16, 18, 26, 27, 28, 31, 32, 45, 51, 66, 67, 79, 95, 105, 196])

np.argsort() : 得到升序的下標

畫圖:

pos = np.arange(len(x_axis))width = 1.0ax = plt.axes()ax.set_xticks(pos + (width / 2))ax.set_xticklabels(x_axis)plt.bar(pos, y_axis, width, color='lightblue')plt.xticks(rotation=30)fig = matplotlib.pyplot.gcf()fig.set_size_inches(16, 10)

計算各個值出現(xiàn)不同次數(shù)的方法:

count_by_occupation2 = user_fields.map(lambda fields: fields[3]).countByValue()print "Map-reduce approach:"print dict(count_by_occupation2)print ""print "countByValue approach:"print dict(count_by_occupation)

Map-reduce approach print(dict(count_by_occupation2)) {u’administrator’: 79, u’retired’: 14, u’lawyer’: 12, u’healthcare’: 16, u’marketing’: 26, u’executive’: 32, u’scientist’: 31, u’student’: 196, u’technician’: 27, u’librarian’: 51, u’programmer’: 66, u’salesman’: 12, u’homemaker’: 7, u’engineer’: 67, u’none’: 9, u’doctor’: 7, u’writer’: 45, u’entertainment’: 18, u’other’: 105, u’educator’: 95, u’artist’: 28}

countByValue approach {u’administrator’: 79, u’writer’: 45, u’retired’: 14, u’lawyer’: 12, u’doctor’: 7, u’marketing’: 26, u’executive’: 32, u’none’: 9, u’entertainment’: 18, u’healthcare’: 16, u’scientist’: 31, u’student’: 196, u’educator’: 95, u’technician’: 27, u’librarian’: 51, u’programmer’: 66, u’artist’: 28, u’salesman’: 12, u’other’: 105, u’homemaker’: 7, u’engineer’: 67}

探索電影數(shù)據(jù) 解析電影分類數(shù)據(jù)的特征

讀數(shù)據(jù)和查看數(shù)據(jù)

讀數(shù)據(jù)

movie_data = sc.textFile("u.item")

查看數(shù)據(jù)

#第一行print movie_data.first()

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0

電影總數(shù)

print "Movies:%d" % num_movies

Movies:1682

對電影發(fā)型的時間做處理 先過慮掉缺失值,定義函數(shù),缺失值取為1900

def convert_year(x):  try:    return int(x[-4:])  except:    return 1900

第3列為時間,格式為:01-Jan-1995 ,-4:得到年

數(shù)據(jù)處理

movie_fields = movie_data.map(lambda lines: lines.split("|"))years = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x))

過濾掉為1900的

years_filtered = years.filter(lambda x: x != 1900)

計算電影的年齡,該數(shù)據(jù)發(fā)生在1998年,要得到發(fā)行時間,需要1998減去時間

movie_ages = years_filtered.map(lambda yr: 1998-yr).countByValue()values = movie_ages.values()bins = movie_ages.keys()print valuesprint bins

[65, 286, 355, 219, 214, 126, 37, 22, 24, 15, 11, 13, 15, 7, 8, 5, 13, 12, 8, 9, 4, 4, 5, 6, 8, 4, 3, 7, 3, 4, 6, 5, 2, 5, 2, 6, 5, 3, 5, 4, 9, 8, 4, 5, 7, 2, 3, 5, 7, 4, 3, 5, 5, 4, 5, 4, 2, 5, 8, 7, 3, 4, 2, 4, 4, 2, 1, 1, 1, 1, 1]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 72, 76]

畫圖

hist(values, bins=bins, color='lightblue', normed=True)fig = matplotlib.pyplot.gcf()fig.set_size_inches(16,10)

探索評級數(shù)據(jù)

讀數(shù)據(jù)、數(shù)據(jù)量

>>> #評級數(shù)據(jù)... rating_data = sc.textFile("u.data")print rating_data.first()196 242 3 881250949num_ratings = rating_data.count()print "Ratings: %d" % num_ratingsRatings: 100000

196 242 3 881250949 Ratings: 100000 總共10萬條數(shù)據(jù)

一些基本統(tǒng)計

#數(shù)據(jù)分割rating_data1 = rating_data.map(lambda line:line.split("/t"))#評分ratings = rating_data1.map(lambda fields:int(fields[2]))#最高得分max_rating = ratings.reduce(lambda x,y:max(x,y))#最低得分min_rating = ratings.reduce(lambda x,y:min(x,y))#評價得分mean_rating = ratings.reduce(lambda x,y:x+y) / num_ratings#中位數(shù)median_rating = np.median(ratings.collect())#平均每個用戶打分數(shù)ratings_per_user = num_ratings / num_users#平均每部電影有多少評分ratings_per_movie = num_ratings / num_moviesprint "Min ratings: %d" % min_ratingprint "Max rating: %d" % max_ratingprint "Average rating: %2.2f" % mean_ratingprint "Median rating: %d" % median_ratingprint "Average # of rating per user:%2.2f" % ratings_per_userprint "Average # of ratings per movie: %2.2f" % ratings_per_movie

Min ratings: 1 Max rating: 5 Average rating: 3.00 Median rating: 4 Average # of rating per user:106.00 Average # of ratings per movie: 59.00

類似功能的函數(shù)stats()

ratings.stats()

(count: 100000, mean: 3.52986, stdev: 1.12566797076, max: 5.0, min: 1.0)

計算每個用戶打分次數(shù):

user_ratings_grouped = rating_data.map(lambda fields:(int(fields[0]),int(fields[2]))).groupByKey()user_ratings_byuser = user_ratings_grouped.map(lambda (k,v):(k,int(len(v))))user_ratings_byuser.take(5)  #這里在spark2.1下報錯,后續(xù)探究

繪圖

user_ratings_byuser_local = user_ratings_byuser.map(lambda (k, v): v).collect()hist(user_ratings_byuser_local, bins=200, color='lightblue', normed=True)fig = matplotlib.pyplot.gcf()fig.set_size_inches(16,10)
發(fā)表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發(fā)表
主站蜘蛛池模板: 元氏县| 堆龙德庆县| 沧州市| 刚察县| 牡丹江市| 洪湖市| 西乌珠穆沁旗| 宁阳县| 开远市| 尼木县| 于田县| 定襄县| 伊春市| 山阴县| 吴堡县| 新津县| 开远市| 朔州市| 扶绥县| 巴彦淖尔市| 盐城市| 天峻县| 蒙阴县| 吉木萨尔县| 清远市| 阿勒泰市| 本溪市| 朝阳区| 湘潭市| 景东| 通道| 驻马店市| 秦安县| 东乡族自治县| 含山县| 塔城市| 东辽县| 扬中市| 邛崃市| 凤山市| 舟曲县|