初學spark,就按照書上的來學習
1、首先,在ubunu下登錄到擁有spark權限的用戶下。
#以我的為例,我的用戶名是hadoopsu hadoop #這里會提示輸入密碼#切換到spark目錄下#cd /home/hadoop/spark2、啟動python環境
./bin/pyspark3、以一個案例為例子來介紹,數據見底部的百度云鏈接
# -*- coding:utf-8 -*- from pyspark import SparkContext#定義SparkContext對象,2個線程,命名為First Spark Appsc = SparkContext("local[2]", "First Spark App")#讀數據,放在spark/data下data = sc.textFile("data/UserPurchaseHistory.csv").map(lambda line: line.split(",")).map(lambda record: (record[0], record[1], record[2]))#計算總購買次數numPurchases = data.count()#計算有多少不同客戶購買過商品uniqueUsers = data.map(lambda record: record[0]).distinct().count()#計算總收入totalRevenue = data.map(lambda record: float(record[2])).sum()#計算最暢銷的產品PRoducts = data.map(lambda record: (record[1], 1.0)).reduceByKey(lambda a, b: a + b).collect()mostPopular = sorted(products, key=lambda x: x[1], reverse=True)[0]#結果打印出來print ("Total purchases: %d" % numPurchases)print ("Unique users: %d" % uniqueUsers)print ("Total revenue: %2.2f" % totalRevenue)print ("Most popular product: %s with %d purchases" % (mostPopular[0], mostPopular[1]))結果: Total purchases: 5 Unique users: 4 Total revenue: 39.91 Most popular product: iphone Cover with 2 purchases
4、停止這個計算
sc.stop()批量計算,不需要進入python環境,直接在spark目錄下: 將上面的py腳本代碼放在spark目錄下 執行:
/home/hadoop/spark/bin/spark-submit pythonapp.py
從運行來說,顯然批量的計算很方便,特別對于大型程序。
代碼和數據:
新聞熱點
疑難解答