python實現爬蟲統計學校BBS男女比例之數據處理（三）

2019-11-25 16:59:59

字體：大中小

來源：轉載

供稿：網友

本文主要介紹了數據處理方面的內容，希望大家仔細閱讀。

一、數據分析

得到了以下列字符串開頭的文本數據，我們需要進行處理

二、回滾

我們需要對httperror的數據進行再處理

因為代碼的原因，具體可見本系列文章（二），會導致文本里面同一個id連續出現幾次httperror記錄：

//httperror265001_266001.txt265002 httperror265002 httperror265002 httperror265002 httperror265003 httperror265003 httperror265003 httperror265003 httperror

所以我們在代碼里要考慮這種情形，不能每一行的id都進行處理，是判斷是否重復的id。

java里面有緩存方法可以避免頻繁讀取硬盤上的文件，python其實也有，可以見這篇文章。

def main():  reload(sys)  sys.setdefaultencoding('utf-8')  global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5  sexRe = re.compile(u'em>/u6027/u522b</em>(.*?)</li')  timeRe = re.compile(u'em>/u4e0a/u6b21/u6d3b/u52a8/u65f6/u95f4</em>(.*?)</li')  notexistRe = re.compile(u'(p>)/u62b1/u6b49/uff0c/u60a8/u6307/u5b9a/u7684/u7528/u6237/u7a7a/u95f4/u4e0d/u5b58/u5728<')  url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s'  url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile'  file1 = 'ruisi//correct_re.txt'  file2 = 'ruisi//errTime_re.txt'  file3 = 'ruisi//notexist_re.txt'  file4 = 'ruisi//unkownsex_re.txt'  file5 = 'ruisi//httperror_re.txt'  #遍歷文件夾里面以httperror開頭的文本  for filename in os.listdir(r'E:/pythonProject/ruisi'):    if filename.startswith('httperror'):      count = 0      newName = 'E://pythonProject//ruisi//%s' % (filename)      readFile = open(newName,'r')      oldLine = '0'      for line in readFile:        #newLine 用來比較是否是重復的id        newLine = line        if (newLine != oldLine):          nu = newLine.split()[0]          oldLine = newLine          count += 1          searchWeb((int(nu),))      print "%s deal %s lines" %(filename, count)

本代碼為了簡便，沒有再把httperror的那些id分類，直接存儲為下面這5個文件里

 file1 = 'ruisi//correct_re.txt'  file2 = 'ruisi//errTime_re.txt'  file3 = 'ruisi//notexist_re.txt'  file4 = 'ruisi//unkownsex_re.txt'  file5 = 'ruisi//httperror_re.txt'

可以看下輸出Log記錄，總共處理了多少個httperror的數據。

"D:/Program Files/Python27/python.exe" E:/pythonProject/webCrawler/reload.pyhttperror132001-133001.txt deal 21 lineshttperror2001-3001.txt deal 4 lineshttperror251001-252001.txt deal 5 lineshttperror254001-255001.txt deal 1 lines

三、單線程統計unkownsex 數據

代碼簡單，我們利用單線程統計一下unkownsex（由于權限原因無法獲取、或者該用戶沒有填寫）的用戶。另外，經過我們檢查，沒有性別的用戶也是沒有活動時間的。

數據格式如下：

253042 unkownsex253087 unkownsex253102 unkownsex253118 unkownsex253125 unkownsex253136 unkownsex253161 unkownseximport os,timesumCount = 0startTime = time.clock()for filename in os.listdir(r'E:/pythonProject/ruisi'):  if filename.startswith('unkownsex'):    count = 0    newName = 'E://pythonProject//ruisi//%s' % (filename)    readFile = open(newName,'r')    for line in open(newName):      count += 1      sumCount +=1    print "%s deal %s lines" %(filename, count)print '%s unkowns sex' %(sumCount)endTime = time.clock()print "cost time " + str(endTime - startTime) + " s"

處理速度很快，輸出如下：

unkownsex1-1001.txt deal 204 linesunkownsex100001-101001.txt deal 50 linesunkownsex10001-11001.txt deal 206 lines#...省略中間輸出信息unkownsex99001-100001.txt deal 56 linesunkownsex_re.txt deal 1085 lines14223 unkowns sexcost time 0.0813142301261 s

四、單線程統計 correct 數據

數據格式如下：

31024 男 2014-11-11 13:2031283 男 2013-3-25 19:4131340 保密 2015-2-2 15:1731427 保密 2014-8-10 09:1731475 保密 2013-7-2 08:5931554 保密 2014-10-17 17:0231621 男 2015-5-16 19:2731872 保密 2015-1-11 16:4931915 保密 2014-5-4 11:0131997 保密 2015-5-16 20:14

代碼如下，實現思路就是一行一行讀取，利用line.split()獲取性別信息。sumCount 是統計一個多少人，boycount 、girlcount 、secretcount 分別統計男、女、保密的人數。我們還是利用unicode進行正則匹配。

import os,sys,timereload(sys)sys.setdefaultencoding('utf-8')startTime = time.clock()sumCount = 0boycount = 0girlcount = 0secretcount = 0for filename in os.listdir(r'E:/pythonProject/ruisi'):  if filename.startswith('correct'):    newName = 'E://pythonProject//ruisi//%s' % (filename)    readFile = open(newName,'r')    for line in readFile:      sexInfo = line.split()[1]      sumCount +=1      if sexInfo == u'/u7537' :        boycount += 1      elif sexInfo == u'/u5973':        girlcount +=1      elif sexInfo == u'/u4fdd/u5bc6':        secretcount +=1    print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)print "total is %s; %s boys; %s girls; %s secret;" %(sumCount, boycount,girlcount,secretcount)endTime = time.clock()print "cost time " + str(endTime - startTime) + " s"

注意，我們輸出的是截止某個文件的統計信息，而不是單個文件的統計情況。輸出結果如下：

until correct1-1001.txt, sum is 110 boys; 7 girls; 414 secret;until correct100001-101001.txt, sum is 125 boys; 13 girls; 542 secret;#...省略until correct99001-100001.txt, sum is 11070 boys; 3113 girls; 26636 secret;until correct_re.txt, sum is 13937 boys; 4007 girls; 28941 secret;total is 46885; 13937 boys; 4007 girls; 28941 secret;cost time 3.60047888495 s

五、多線程統計數據

為了更快統計，我們可以利用多線程。
作為對比，我們試下單線程需要的時間。

# encoding: UTF-8import threadingimport time,os,sys#全局變量SUM = 0BOY = 0GIRL = 0SECRET = 0NUM =0#本來繼承自threading.Thread，覆蓋run()方法，用start()啟動線程#這和java里面很像class StaFileList(threading.Thread):  #文本名稱列表  fileList = []  def __init__(self, fileList):    threading.Thread.__init__(self)    self.fileList = fileList  def run(self):    global SUM, BOY, GIRL, SECRET    #可以加上個耗時時間，這樣多線程更加明顯，而不是順序的thread-1,2,3    #time.sleep(1)    #acquire獲取鎖    if mutex.acquire(1):      self.staFiles(self.fileList)      #release釋放鎖      mutex.release()  #處理輸入的files列表，統計男女人數  #注意這兒數據同步問題，global使用全局變量  def staFiles(self, files):    global SUM, BOY, GIRL, SECRET    for name in files:      newName = 'E://pythonProject//ruisi//%s' % (name)      readFile = open(newName,'r')      for line in readFile:        sexInfo = line.split()[1]        SUM +=1        if sexInfo == u'/u7537' :          BOY += 1        elif sexInfo == u'/u5973':          GIRL +=1        elif sexInfo == u'/u4fdd/u5bc6':          SECRET +=1      # print "thread %s, until %s, total is %s; %s boys; %s girls;" /      #    " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)def test():  #files保存多個文件，可以設定一個線程處理多少個文件  files = []  #用來保存所有的線程，方便最后主線程等待所以子線程結束  staThreads = []  i = 0  for filename in os.listdir(r'E:/pythonProject/ruisi'):    #沒獲取10個文本，就創建一個線程    if filename.startswith('correct'):      files.append(filename)      i+=1      #一個線程處理20個文件      if i == 20 :        staThreads.append(StaFileList(files))        files = []        i = 0  #最后剩余的files，很可能長度不足10個  if files:    staThreads.append(StaFileList(files))  for t in staThreads:    t.start()  # 主線程中等待所有子線程退出，如果不加這個，速度更快些？  for t in staThreads:    t.join()if __name__ == '__main__':  reload(sys)  sys.setdefaultencoding('utf-8')  startTime = time.clock()  mutex = threading.Lock()  test()  print "Multi Thread, total is %s; %s boys; %s girls; %s secret;" %(SUM, BOY,GIRL,SECRET)  endTime = time.clock()  print "cost time " + str(endTime - startTime) + " s"

輸出

Multi Thread, total is 46885; 13937 boys; 4007 girls; 28941 secret;cost time 0.132137192794 s

我們發現時間和單線程差不多。因為這兒涉及到線程同步問題，獲取鎖和釋放鎖都是需要時間開銷的，線程間切換保存中斷和恢復中斷也都是需要時間開銷的。

六、較多數據的單線程和多線程對比

我們可以對correct、errTime 、unkownsex的文本都進行處理。
單線程代碼

# coding=utf-8import os,sys,timereload(sys)sys.setdefaultencoding('utf-8')startTime = time.clock()sumCount = 0boycount = 0girlcount = 0secretcount = 0unkowncount = 0for filename in os.listdir(r'E:/pythonProject/ruisi'):  # 有性別、活動時間  if filename.startswith('correct') :    newName = 'E://pythonProject//ruisi//%s' % (filename)    readFile = open(newName,'r')    for line in readFile:      sexInfo =line.split()[1]      sumCount +=1      if sexInfo == u'/u7537' :        boycount += 1      elif sexInfo == u'/u5973':        girlcount +=1      elif sexInfo == u'/u4fdd/u5bc6':        secretcount +=1    # print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)  #沒有活動時間，但是有性別  elif filename.startswith("errTime"):    newName = 'E://pythonProject//ruisi//%s' % (filename)    readFile = open(newName,'r')    for line in readFile:      sexInfo =line.split()[1]      sumCount +=1      if sexInfo == u'/u7537' :        boycount += 1      elif sexInfo == u'/u5973':        girlcount +=1      elif sexInfo == u'/u4fdd/u5bc6':        secretcount +=1    # print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)  #沒有性別，也沒有時間，直接統計行數  elif filename.startswith("unkownsex"):    newName = 'E://pythonProject//ruisi//%s' % (filename)    # count = len(open(newName,'rU').readlines())    #對于大文件用循環方法，count 初始值為 -1 是為了應對空行的情況，最后+1得到0行    count = -1    for count, line in enumerate(open(newName, 'rU')):      pass    count += 1    unkowncount += count    sumCount += count    # print "until %s, sum is %s unkownsex" %(filename, unkowncount)print "Single Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex;" %(sumCount, boycount,girlcount,secretcount,unkowncount)endTime = time.clock()print "cost time " + str(endTime - startTime) + " s"

輸出為

Single Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.37444645628 s

多線程代碼

__author__ = 'admin'# encoding: UTF-8#多線程處理程序import threadingimport time,os,sys#全局變量SUM = 0BOY = 0GIRL = 0SECRET = 0UNKOWN = 0class StaFileList(threading.Thread):  #文本名稱列表  fileList = []  def __init__(self, fileList):    threading.Thread.__init__(self)    self.fileList = fileList  def run(self):    global SUM, BOY, GIRL, SECRET    if mutex.acquire(1):      self.staManyFiles(self.fileList)      mutex.release()  #處理輸入的files列表，統計男女人數  #注意這兒數據同步問題  def staCorrectFiles(self, files):    global SUM, BOY, GIRL, SECRET    for name in files:      newName = 'E://pythonProject//ruisi//%s' % (name)      readFile = open(newName,'r')      for line in readFile:        sexInfo = line.split()[1]        SUM +=1        if sexInfo == u'/u7537' :          BOY += 1        elif sexInfo == u'/u5973':          GIRL +=1        elif sexInfo == u'/u4fdd/u5bc6':          SECRET +=1      # print "thread %s, until %s, total is %s; %s boys; %s girls;" /      #    " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)  def staManyFiles(self, files):    global SUM, BOY, GIRL, SECRET,UNKOWN    for name in files:      if name.startswith('correct') :        newName = 'E://pythonProject//ruisi//%s' % (name)        readFile = open(newName,'r')        for line in readFile:          sexInfo = line.split()[1]          SUM +=1          if sexInfo == u'/u7537' :            BOY += 1          elif sexInfo == u'/u5973':            GIRL +=1          elif sexInfo == u'/u4fdd/u5bc6':            SECRET +=1        # print "thread %s, until %s, total is %s; %s boys; %s girls;" /        #    " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)      #沒有活動時間，但是有性別      elif name.startswith("errTime"):        newName = 'E://pythonProject//ruisi//%s' % (name)        readFile = open(newName,'r')        for line in readFile:          sexInfo = line.split()[1]          SUM +=1          if sexInfo == u'/u7537' :            BOY += 1          elif sexInfo == u'/u5973':            GIRL +=1          elif sexInfo == u'/u4fdd/u5bc6':            SECRET +=1        # print "thread %s, until %s, total is %s; %s boys; %s girls;" /        #    " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)      #沒有性別，也沒有時間，直接統計行數      elif name.startswith("unkownsex"):        newName = 'E://pythonProject//ruisi//%s' % (name)        # count = len(open(newName,'rU').readlines())        #對于大文件用循環方法，count 初始值為 -1 是為了應對空行的情況，最后+1得到0行        count = -1        for count, line in enumerate(open(newName, 'rU')):          pass        count += 1        UNKOWN += count        SUM += count        # print "thread %s, until %s, total is %s; %s unkownsex" %(self.name, name, SUM, UNKOWN)def test():  files = []  #用來保存所有的線程，方便最后主線程等待所以子線程結束  staThreads = []  i = 0  for filename in os.listdir(r'E:/pythonProject/ruisi'):    #沒獲取10個文本，就創建一個線程    if filename.startswith("correct") or filename.startswith("errTime") or filename.startswith("unkownsex"):      files.append(filename)      i+=1      if i == 20 :        staThreads.append(StaFileList(files))        files = []        i = 0  #最后剩余的files，很可能長度不足10個  if files:    staThreads.append(StaFileList(files))  for t in staThreads:    t.start()  # 主線程中等待所有子線程退出  for t in staThreads:    t.join()if __name__ == '__main__':  reload(sys)  sys.setdefaultencoding('utf-8')  startTime = time.clock()  mutex = threading.Lock()  test()  print "Multi Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex" %(SUM, BOY,GIRL,SECRET,UNKOWN)  endTime = time.clock()  print "cost time " + str(endTime - startTime) + " s"  endTime = time.clock()  print "cost time " + str(endTime - startTime) + " s"

輸出為

Multi Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret;
cost time 1.23049112201 s
可以看出多線程還是優于單線程的，由于使用的同步，數據統計是一直的。

注意python在類內部經常需要加上self，這點和java區別很大。

 def __init__(self, fileList):    threading.Thread.__init__(self)    self.fileList = fileList  def run(self):    global SUM, BOY, GIRL, SECRET    if mutex.acquire(1):      #調用類內部方法需要加self      self.staFiles(self.fileList)      mutex.release()

total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.25413238673 s

以上就是本文的全部內容，希望對大家的學習有所幫助。

上一篇：基于python的Tkinter實現一個簡易計算器

下一篇：python實現爬蟲統計學校BBS男女比例之多線程爬蟲（二）