本文主要介紹了數據處理方面的內容,希望大家仔細閱讀。
一、數據分析

得到了以下列字符串開頭的文本數據,我們需要進行處理

二、回滾
我們需要對httperror的數據進行再處理
因為代碼的原因,具體可見本系列文章(二),會導致文本里面同一個id連續出現幾次httperror記錄:
//httperror265001_266001.txt265002 httperror265002 httperror265002 httperror265002 httperror265003 httperror265003 httperror265003 httperror265003 httperror
所以我們在代碼里要考慮這種情形,不能每一行的id都進行處理,是判斷是否重復的id。
java里面有緩存方法可以避免頻繁讀取硬盤上的文件,python其實也有,可以見這篇文章。
def main(): reload(sys) sys.setdefaultencoding('utf-8') global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5 sexRe = re.compile(u'em>/u6027/u522b</em>(.*?)</li') timeRe = re.compile(u'em>/u4e0a/u6b21/u6d3b/u52a8/u65f6/u95f4</em>(.*?)</li') notexistRe = re.compile(u'(p>)/u62b1/u6b49/uff0c/u60a8/u6307/u5b9a/u7684/u7528/u6237/u7a7a/u95f4/u4e0d/u5b58/u5728<') url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s' url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile' file1 = 'ruisi//correct_re.txt' file2 = 'ruisi//errTime_re.txt' file3 = 'ruisi//notexist_re.txt' file4 = 'ruisi//unkownsex_re.txt' file5 = 'ruisi//httperror_re.txt' #遍歷文件夾里面以httperror開頭的文本 for filename in os.listdir(r'E:/pythonProject/ruisi'): if filename.startswith('httperror'): count = 0 newName = 'E://pythonProject//ruisi//%s' % (filename) readFile = open(newName,'r') oldLine = '0' for line in readFile: #newLine 用來比較是否是重復的id newLine = line if (newLine != oldLine): nu = newLine.split()[0] oldLine = newLine count += 1 searchWeb((int(nu),)) print "%s deal %s lines" %(filename, count)本代碼為了簡便,沒有再把httperror的那些id分類,直接存儲為下面這5個文件里
file1 = 'ruisi//correct_re.txt' file2 = 'ruisi//errTime_re.txt' file3 = 'ruisi//notexist_re.txt' file4 = 'ruisi//unkownsex_re.txt' file5 = 'ruisi//httperror_re.txt'
可以看下輸出Log記錄,總共處理了多少個httperror的數據。
"D:/Program Files/Python27/python.exe" E:/pythonProject/webCrawler/reload.pyhttperror132001-133001.txt deal 21 lineshttperror2001-3001.txt deal 4 lineshttperror251001-252001.txt deal 5 lineshttperror254001-255001.txt deal 1 lines
三、單線程統計unkownsex 數據
代碼簡單,我們利用單線程統計一下unkownsex(由于權限原因無法獲取、或者該用戶沒有填寫)的用戶。另外,經過我們檢查,沒有性別的用戶也是沒有活動時間的。
數據格式如下:
253042 unkownsex253087 unkownsex253102 unkownsex253118 unkownsex253125 unkownsex253136 unkownsex253161 unkownseximport os,timesumCount = 0startTime = time.clock()for filename in os.listdir(r'E:/pythonProject/ruisi'): if filename.startswith('unkownsex'): count = 0 newName = 'E://pythonProject//ruisi//%s' % (filename) readFile = open(newName,'r') for line in open(newName): count += 1 sumCount +=1 print "%s deal %s lines" %(filename, count)print '%s unkowns sex' %(sumCount)endTime = time.clock()print "cost time " + str(endTime - startTime) + " s"處理速度很快,輸出如下:
unkownsex1-1001.txt deal 204 linesunkownsex100001-101001.txt deal 50 linesunkownsex10001-11001.txt deal 206 lines#...省略中間輸出信息unkownsex99001-100001.txt deal 56 linesunkownsex_re.txt deal 1085 lines14223 unkowns sexcost time 0.0813142301261 s
四、單線程統計 correct 數據
數據格式如下:
31024 男 2014-11-11 13:2031283 男 2013-3-25 19:4131340 保密 2015-2-2 15:1731427 保密 2014-8-10 09:1731475 保密 2013-7-2 08:5931554 保密 2014-10-17 17:0231621 男 2015-5-16 19:2731872 保密 2015-1-11 16:4931915 保密 2014-5-4 11:0131997 保密 2015-5-16 20:14
代碼如下,實現思路就是一行一行讀取,利用line.split()獲取性別信息。sumCount 是統計一個多少人,boycount 、girlcount 、secretcount 分別統計男、女、保密的人數。我們還是利用unicode進行正則匹配。
import os,sys,timereload(sys)sys.setdefaultencoding('utf-8')startTime = time.clock()sumCount = 0boycount = 0girlcount = 0secretcount = 0for filename in os.listdir(r'E:/pythonProject/ruisi'): if filename.startswith('correct'): newName = 'E://pythonProject//ruisi//%s' % (filename) readFile = open(newName,'r') for line in readFile: sexInfo = line.split()[1] sumCount +=1 if sexInfo == u'/u7537' : boycount += 1 elif sexInfo == u'/u5973': girlcount +=1 elif sexInfo == u'/u4fdd/u5bc6': secretcount +=1 print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)print "total is %s; %s boys; %s girls; %s secret;" %(sumCount, boycount,girlcount,secretcount)endTime = time.clock()print "cost time " + str(endTime - startTime) + " s"注意,我們輸出的是截止某個文件的統計信息,而不是單個文件的統計情況。輸出結果如下:
until correct1-1001.txt, sum is 110 boys; 7 girls; 414 secret;until correct100001-101001.txt, sum is 125 boys; 13 girls; 542 secret;#...省略until correct99001-100001.txt, sum is 11070 boys; 3113 girls; 26636 secret;until correct_re.txt, sum is 13937 boys; 4007 girls; 28941 secret;total is 46885; 13937 boys; 4007 girls; 28941 secret;cost time 3.60047888495 s
五、多線程統計數據
為了更快統計,我們可以利用多線程。
作為對比,我們試下單線程需要的時間。
# encoding: UTF-8import threadingimport time,os,sys#全局變量SUM = 0BOY = 0GIRL = 0SECRET = 0NUM =0#本來繼承自threading.Thread,覆蓋run()方法,用start()啟動線程#這和java里面很像class StaFileList(threading.Thread): #文本名稱列表 fileList = [] def __init__(self, fileList): threading.Thread.__init__(self) self.fileList = fileList def run(self): global SUM, BOY, GIRL, SECRET #可以加上個耗時時間,這樣多線程更加明顯,而不是順序的thread-1,2,3 #time.sleep(1) #acquire獲取鎖 if mutex.acquire(1): self.staFiles(self.fileList) #release釋放鎖 mutex.release() #處理輸入的files列表,統計男女人數 #注意這兒數據同步問題,global使用全局變量 def staFiles(self, files): global SUM, BOY, GIRL, SECRET for name in files: newName = 'E://pythonProject//ruisi//%s' % (name) readFile = open(newName,'r') for line in readFile: sexInfo = line.split()[1] SUM +=1 if sexInfo == u'/u7537' : BOY += 1 elif sexInfo == u'/u5973': GIRL +=1 elif sexInfo == u'/u4fdd/u5bc6': SECRET +=1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" / # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)def test(): #files保存多個文件,可以設定一個線程處理多少個文件 files = [] #用來保存所有的線程,方便最后主線程等待所以子線程結束 staThreads = [] i = 0 for filename in os.listdir(r'E:/pythonProject/ruisi'): #沒獲取10個文本,就創建一個線程 if filename.startswith('correct'): files.append(filename) i+=1 #一個線程處理20個文件 if i == 20 : staThreads.append(StaFileList(files)) files = [] i = 0 #最后剩余的files,很可能長度不足10個 if files: staThreads.append(StaFileList(files)) for t in staThreads: t.start() # 主線程中等待所有子線程退出,如果不加這個,速度更快些? for t in staThreads: t.join()if __name__ == '__main__': reload(sys) sys.setdefaultencoding('utf-8') startTime = time.clock() mutex = threading.Lock() test() print "Multi Thread, total is %s; %s boys; %s girls; %s secret;" %(SUM, BOY,GIRL,SECRET) endTime = time.clock() print "cost time " + str(endTime - startTime) + " s"輸出
Multi Thread, total is 46885; 13937 boys; 4007 girls; 28941 secret;cost time 0.132137192794 s
我們發現時間和單線程差不多。因為這兒涉及到線程同步問題,獲取鎖和釋放鎖都是需要時間開銷的,線程間切換保存中斷和恢復中斷也都是需要時間開銷的。
六、較多數據的單線程和多線程對比
我們可以對correct、errTime 、unkownsex的文本都進行處理。
單線程代碼
# coding=utf-8import os,sys,timereload(sys)sys.setdefaultencoding('utf-8')startTime = time.clock()sumCount = 0boycount = 0girlcount = 0secretcount = 0unkowncount = 0for filename in os.listdir(r'E:/pythonProject/ruisi'): # 有性別、活動時間 if filename.startswith('correct') : newName = 'E://pythonProject//ruisi//%s' % (filename) readFile = open(newName,'r') for line in readFile: sexInfo =line.split()[1] sumCount +=1 if sexInfo == u'/u7537' : boycount += 1 elif sexInfo == u'/u5973': girlcount +=1 elif sexInfo == u'/u4fdd/u5bc6': secretcount +=1 # print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount) #沒有活動時間,但是有性別 elif filename.startswith("errTime"): newName = 'E://pythonProject//ruisi//%s' % (filename) readFile = open(newName,'r') for line in readFile: sexInfo =line.split()[1] sumCount +=1 if sexInfo == u'/u7537' : boycount += 1 elif sexInfo == u'/u5973': girlcount +=1 elif sexInfo == u'/u4fdd/u5bc6': secretcount +=1 # print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount) #沒有性別,也沒有時間,直接統計行數 elif filename.startswith("unkownsex"): newName = 'E://pythonProject//ruisi//%s' % (filename) # count = len(open(newName,'rU').readlines()) #對于大文件用循環方法,count 初始值為 -1 是為了應對空行的情況,最后+1得到0行 count = -1 for count, line in enumerate(open(newName, 'rU')): pass count += 1 unkowncount += count sumCount += count # print "until %s, sum is %s unkownsex" %(filename, unkowncount)print "Single Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex;" %(sumCount, boycount,girlcount,secretcount,unkowncount)endTime = time.clock()print "cost time " + str(endTime - startTime) + " s"輸出為
Single Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.37444645628 s
多線程代碼
__author__ = 'admin'# encoding: UTF-8#多線程處理程序import threadingimport time,os,sys#全局變量SUM = 0BOY = 0GIRL = 0SECRET = 0UNKOWN = 0class StaFileList(threading.Thread): #文本名稱列表 fileList = [] def __init__(self, fileList): threading.Thread.__init__(self) self.fileList = fileList def run(self): global SUM, BOY, GIRL, SECRET if mutex.acquire(1): self.staManyFiles(self.fileList) mutex.release() #處理輸入的files列表,統計男女人數 #注意這兒數據同步問題 def staCorrectFiles(self, files): global SUM, BOY, GIRL, SECRET for name in files: newName = 'E://pythonProject//ruisi//%s' % (name) readFile = open(newName,'r') for line in readFile: sexInfo = line.split()[1] SUM +=1 if sexInfo == u'/u7537' : BOY += 1 elif sexInfo == u'/u5973': GIRL +=1 elif sexInfo == u'/u4fdd/u5bc6': SECRET +=1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" / # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET) def staManyFiles(self, files): global SUM, BOY, GIRL, SECRET,UNKOWN for name in files: if name.startswith('correct') : newName = 'E://pythonProject//ruisi//%s' % (name) readFile = open(newName,'r') for line in readFile: sexInfo = line.split()[1] SUM +=1 if sexInfo == u'/u7537' : BOY += 1 elif sexInfo == u'/u5973': GIRL +=1 elif sexInfo == u'/u4fdd/u5bc6': SECRET +=1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" / # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET) #沒有活動時間,但是有性別 elif name.startswith("errTime"): newName = 'E://pythonProject//ruisi//%s' % (name) readFile = open(newName,'r') for line in readFile: sexInfo = line.split()[1] SUM +=1 if sexInfo == u'/u7537' : BOY += 1 elif sexInfo == u'/u5973': GIRL +=1 elif sexInfo == u'/u4fdd/u5bc6': SECRET +=1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" / # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET) #沒有性別,也沒有時間,直接統計行數 elif name.startswith("unkownsex"): newName = 'E://pythonProject//ruisi//%s' % (name) # count = len(open(newName,'rU').readlines()) #對于大文件用循環方法,count 初始值為 -1 是為了應對空行的情況,最后+1得到0行 count = -1 for count, line in enumerate(open(newName, 'rU')): pass count += 1 UNKOWN += count SUM += count # print "thread %s, until %s, total is %s; %s unkownsex" %(self.name, name, SUM, UNKOWN)def test(): files = [] #用來保存所有的線程,方便最后主線程等待所以子線程結束 staThreads = [] i = 0 for filename in os.listdir(r'E:/pythonProject/ruisi'): #沒獲取10個文本,就創建一個線程 if filename.startswith("correct") or filename.startswith("errTime") or filename.startswith("unkownsex"): files.append(filename) i+=1 if i == 20 : staThreads.append(StaFileList(files)) files = [] i = 0 #最后剩余的files,很可能長度不足10個 if files: staThreads.append(StaFileList(files)) for t in staThreads: t.start() # 主線程中等待所有子線程退出 for t in staThreads: t.join()if __name__ == '__main__': reload(sys) sys.setdefaultencoding('utf-8') startTime = time.clock() mutex = threading.Lock() test() print "Multi Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex" %(SUM, BOY,GIRL,SECRET,UNKOWN) endTime = time.clock() print "cost time " + str(endTime - startTime) + " s" endTime = time.clock() print "cost time " + str(endTime - startTime) + " s"輸出為
Multi Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret;
cost time 1.23049112201 s
可以看出多線程還是優于單線程的,由于使用的同步,數據統計是一直的。
注意python在類內部經常需要加上self,這點和java區別很大。
def __init__(self, fileList): threading.Thread.__init__(self) self.fileList = fileList def run(self): global SUM, BOY, GIRL, SECRET if mutex.acquire(1): #調用類內部方法需要加self self.staFiles(self.fileList) mutex.release()
total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.25413238673 s
以上就是本文的全部內容,希望對大家的學習有所幫助。
新聞熱點
疑難解答
圖片精選