利用Python中的pandas庫對cdn日志進行分析詳解

2019-11-25 16:19:43

字體：大中小

來源：轉載

供稿：網友

前言

最近工作工作中遇到一個需求，是要根據(jù)CDN日志過濾一些數(shù)據(jù)，例如流量、狀態(tài)碼統(tǒng)計，TOP IP、URL、UA、Referer等。以前都是用 bash shell 實現(xiàn)的，但是當日志量較大，日志文件數(shù)G、行數(shù)達數(shù)千萬億級時，通過 shell 處理有些力不從心，處理時間過長。于是研究了下Python pandas這個數(shù)據(jù)處理庫的使用。一千萬行日志，處理完成在40s左右。

代碼

#!/usr/bin/python# -*- coding: utf-8 -*-# sudo pip install pandas__author__ = 'Loya Chen'import sysimport pandas as pdfrom collections import OrderedDict"""Description: This script is used to analyse qiniu cdn log.================================================================================日志格式IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA"================================================================================日志示例 [0] [1][2]  [3]  [4]   [5]101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET http://www.qn.com/1.jpg -" [6] [7] [8]    [9]200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"================================================================================"""if len(sys.argv) != 2: print('Usage:', sys.argv[0], 'file_of_log') exit() else: log_file = sys.argv[1] # 需統(tǒng)計字段對應的日志位置 ip  = 0url  = 5status_code = 6size = 7referer = 8ua  = 9# 將日志讀入DataFramereader = pd.read_table(log_file, sep=' ', names=[i for i in range(10)], iterator=True)loop = TruechunkSize = 10000000chunks = []while loop: try: chunk = reader.get_chunk(chunkSize) chunks.append(chunk) except StopIteration: #Iteration is stopped. loop = Falsedf = pd.concat(chunks, ignore_index=True)byte_sum = df[size].sum()        #流量統(tǒng)計top_status_code = pd.DataFrame(df[6].value_counts())      #狀態(tài)碼統(tǒng)計top_ip  = df[ip].value_counts().head(10)      #TOP IPtop_referer = df[referer].value_counts().head(10)      #TOP Referertop_ua  = df[ua].value_counts().head(10)      #TOP User-Agenttop_status_code['persent'] = pd.DataFrame(top_status_code/top_status_code.sum()*100)top_url  = df[url].value_counts().head(10)      #TOP URLtop_url_byte = df[[url,size]].groupby(url).sum().apply(lambda x:x.astype(float)/1024/1024) /   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #請求流量最大的URLtop_ip_byte = df[[ip,size]].groupby(ip).sum().apply(lambda x:x.astype(float)/1024/1024) /   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #請求流量最多的IP# 將結果有序存入字典result = OrderedDict([("流量總計[單位:GB]:"   , byte_sum/1024/1024/1024),   ("狀態(tài)碼統(tǒng)計[次數(shù)|百分比]:"  , top_status_code),   ("IP TOP 10:"    , top_ip),   ("Referer TOP 10:"   , top_referer),   ("UA TOP 10:"    , top_ua),   ("URL TOP 10:"   , top_url),   ("請求流量最大的URL TOP 10[單位:MB]:" , top_url_byte),    ("請求流量最大的IP TOP 10[單位:MB]:" , top_ip_byte)])# 輸出結果for k,v in result.items(): print(k) print(v) print('='*80)

pandas 學習筆記

Pandas 中有兩種基本的數(shù)據(jù)結構，Series 和 Dataframe。 Series 是一種類似于一維數(shù)組的對象，由一組數(shù)據(jù)和索引組成。 Dataframe 是一個表格型的數(shù)據(jù)結構，既有行索引也有列索引。

from pandas import Series, DataFrameimport pandas as pd

Series

In [1]: obj = Series([4, 7, -5, 3])In [2]: objOut[2]: 0 41 72 -53 3

Series的字符串表現(xiàn)形式為：索引在左邊，值在右邊。沒有指定索引時，會自動創(chuàng)建一個0到N-1（N為數(shù)據(jù)的長度）的整數(shù)型索引?？梢酝ㄟ^Series的values和index屬性獲取其數(shù)組表示形式和索引對象:

In [3]: obj.valuesOut[3]: array([ 4, 7, -5, 3])In [4]: obj.indexOut[4]: RangeIndex(start=0, stop=4, step=1)

通常創(chuàng)建Series時會指定索引:

In [5]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])In [6]: obj2Out[6]: d 4b 7a -5c 3

通過索引獲取Series中的單個或一組值：

In [7]: obj2['a']Out[7]: -5In [8]: obj2[['c','d']]Out[8]: c 3d 4

排序

In [9]: obj2.sort_index()Out[9]: a -5b 7c 3d 4In [10]: obj2.sort_values()Out[10]: a -5c 3d 4b 7

篩選運算

In [11]: obj2[obj2 > 0]Out[11]: d 4b 7c 3In [12]: obj2 * 2Out[12]: d 8b 14a -10c 6

成員

In [13]: 'b' in obj2Out[13]: TrueIn [14]: 'e' in obj2Out[14]: False

通過字典創(chuàng)建Series

In [15]: sdata = {'Shanghai':35000, 'Beijing':40000, 'Nanjing':26000, 'Hangzhou':30000}In [16]: obj3 = Series(sdata)In [17]: obj3Out[17]: Beijing 40000Hangzhou 30000Nanjing 26000Shanghai 35000

如果只傳入一個字典，則結果Series中的索引就是原字典的鍵（有序排列）

In [18]: states = ['Beijing', 'Hangzhou', 'Shanghai', 'Suzhou']In [19]: obj4 = Series(sdata, index=states)In [20]: obj4Out[20]: Beijing 40000.0Hangzhou 30000.0Shanghai 35000.0Suzhou  NaN

當指定index時，sdata中跟states索引相匹配的3個值會被找出并放到響應的位置上，但由于‘Suzhou'所對應的sdata值找不到，所以其結果為NaN(not a number),pandas中用于表示缺失或NA值

pandas的isnull和notnull函數(shù)可以用于檢測缺失數(shù)據(jù):

In [21]: pd.isnull(obj4)Out[21]: Beijing FalseHangzhou FalseShanghai FalseSuzhou TrueIn [22]: pd.notnull(obj4)Out[22]: Beijing TrueHangzhou TrueShanghai TrueSuzhou False

Series也有類似的實例方法

In [23]: obj4.isnull()Out[23]: Beijing FalseHangzhou FalseShanghai FalseSuzhou True

Series的一個重要功能是，在數(shù)據(jù)運算中，自動對齊不同索引的數(shù)據(jù)

In [24]: obj3Out[24]: Beijing 40000Hangzhou 30000Nanjing 26000Shanghai 35000In [25]: obj4Out[25]: Beijing 40000.0Hangzhou 30000.0Shanghai 35000.0Suzhou  NaNIn [26]: obj3 + obj4Out[26]: Beijing 80000.0Hangzhou 60000.0Nanjing  NaNShanghai 70000.0Suzhou  NaN

Series的索引可以通過復制的方式就地修改

In [27]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']In [28]: objOut[28]: Bob 4Steve 7Jeff -5Ryan 3

DataFrame

pandas讀取文件

In [29]: df = pd.read_table('pandas_test.txt',sep=' ', names=['name', 'age'])In [30]: dfOut[30]:  name age0 Bob 261 Loya 222 Denny 203 Mars 25

DataFrame列選取

df[name]

In [31]: df['name']Out[31]: 0 Bob1 Loya2 Denny3 MarsName: name, dtype: object

DataFrame行選取

df.iloc[0,:] #第一個參數(shù)是第幾行，第二個參數(shù)是列。這里指第0行全部列df.iloc[:,0] #全部行，第0列

In [32]: df.iloc[0,:]Out[32]: name Bobage 26Name: 0, dtype: objectIn [33]: df.iloc[:,0]Out[33]: 0 Bob1 Loya2 Denny3 MarsName: name, dtype: object

獲取一個元素，可以通過iloc，更快的方式是iat

In [34]: df.iloc[1,1]Out[34]: 22In [35]: df.iat[1,1]Out[35]: 22

DataFrame塊選取

In [36]: df.loc[1:2,['name','age']]Out[36]:  name age1 Loya 222 Denny 20

根據(jù)條件過濾行

在方括號中加入判斷條件來過濾行，條件必需返回 True 或者 False

In [37]: df[(df.index >= 1) & (df.index <= 3)]Out[37]:  name age city1 Loya 22 Shanghai2 Denny 20 Hangzhou3 Mars 25 NanjingIn [38]: df[df['age'] > 22]Out[38]:  name age city0 Bob 26 Beijing3 Mars 25 Nanjing

增加列

In [39]: df['city'] = ['Beijing', 'Shanghai', 'Hangzhou', 'Nanjing']In [40]: dfOut[40]:  name age city0 Bob 26 Beijing1 Loya 22 Shanghai2 Denny 20 Hangzhou3 Mars 25 Nanjing

排序

按指定列排序

In [41]: df.sort_values(by='age')Out[41]:  name age city2 Denny 20 Hangzhou1 Loya 22 Shanghai3 Mars 25 Nanjing0 Bob 26 Beijing

# 引入numpy 構建 DataFrameimport numpy as np

In [42]: df = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])In [43]: dfOut[43]:  d a b cthree 0 1 2 3one 4 5 6 7

# 以索引排序In [44]: df.sort_index()Out[44]:  d a b cone 4 5 6 7three 0 1 2 3In [45]: df.sort_index(axis=1)Out[45]:  a b c dthree 1 2 3 0one 5 6 7 4# 降序In [46]: df.sort_index(axis=1, ascending=False)Out[46]:  d c b athree 0 3 2 1one 4 7 6 5

查看

# 查看表頭5行 df.head(5)# 查看表末5行df.tail(5) # 查看列的名字In [47]: df.columnsOut[47]: Index(['name', 'age', 'city'], dtype='object')# 查看表格當前的值In [48]: df.valuesOut[48]: array([['Bob', 26, 'Beijing'], ['Loya', 22, 'Shanghai'], ['Denny', 20, 'Hangzhou'], ['Mars', 25, 'Nanjing']], dtype=object)

轉置

df.TOut[49]:   0  1  2 3name Bob Loya Denny Marsage 26 22 20 25city Beijing Shanghai Hangzhou Nanjing

使用isin

In [50]: df2 = df.copy()In [51]: df2[df2['city'].isin(['Shanghai','Nanjing'])]Out[52]:  name age city1 Loya 22 Shanghai3 Mars 25 Nanjing

運算操作：

In [53]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],  ...:    index=['a', 'b', 'c', 'd'], columns=['one', 'two'])In [54]: dfOut[54]:  one twoa 1.40 NaNb 7.10 -4.5c NaN NaNd 0.75 -1.3

#按列求和In [55]: df.sum()Out[55]: one 9.25two -5.80# 按行求和In [56]: df.sum(axis=1)Out[56]: a 1.40b 2.60c NaNd -0.55

group

group 指的如下幾步：

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

See the Grouping section

In [57]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ....:    'foo', 'bar', 'foo', 'foo'], ....:   'B' : ['one', 'one', 'two', 'three', ....:    'two', 'two', 'one', 'three'], ....:   'C' : np.random.randn(8), ....:   'D' : np.random.randn(8)}) ....: In [58]: dfOut[58]:  A B  C  D0 foo one -1.202872 -0.0552241 bar one -1.814470 2.3959852 foo two 1.018601 1.5528253 bar three -0.595447 0.1665994 foo two 1.395433 0.0476095 bar two -0.392670 -0.1364736 foo one 0.007207 -0.5617577 foo three 1.928123 -1.623033

group一下，然后應用sum函數(shù)

In [59]: df.groupby('A').sum()Out[59]:   C DA   bar -2.802588 2.42611foo 3.146492 -0.63958In [60]: df.groupby(['A','B']).sum()Out[60]:    C  DA B   bar one -1.814470 2.395985 three -0.595447 0.166599 two -0.392670 -0.136473foo one -1.195665 -0.616981 three 1.928123 -1.623033 two 2.414034 1.600434

總結

以上就是關于利用Python中的pandas庫進行cdn日志分析的全部內容了，希望本文的內容對大家的學習或者工作能帶來一定的幫助，如果有疑問大家可以留言交流，謝謝大家對武林網的支持。

上一篇：Python運算符重載詳解及實例代碼

下一篇：python下os模塊強大的重命名方法renames詳解