python實現(xiàn)爬蟲統(tǒng)計學校BBS男女比例之多線程爬蟲（二）

2019-11-25 17:00:02

字體：大中小

來源：轉(zhuǎn)載

供稿：網(wǎng)友

接著第一篇繼續(xù)學習。

一、數(shù)據(jù)分類

正確數(shù)據(jù)：id、性別、活動時間三者都有

放在這個文件里file1 = 'ruisi//correct%s-%s.txt' % (startNum, endNum)

數(shù)據(jù)格式為293001 男 2015-5-1 19:17

沒有時間：有id、有性別，無活動時間

放這個文件里file2 = 'ruisi//errTime%s-%s.txt' % (startNum, endNum)

數(shù)據(jù)格式為2566 女 notime

用戶不存在：該id沒有對應的用戶

放這個文件里file3 = 'ruisi//notexist%s-%s.txt' % (startNum, endNum)

數(shù)據(jù)格式為29005 notexist

未知性別：有id，但是性別從網(wǎng)頁上無法得知（經(jīng)檢查，這種情況也沒有活動時間）

放這個文件里 file4 = 'ruisi//unkownsex%s-%s.txt' % (startNum, endNum)

數(shù)據(jù)格式 221794 unkownsex

網(wǎng)絡錯誤：網(wǎng)斷了，或者服務器故障，需要對這些id重新檢查

放這個文件里 file5 = 'ruisi//httperror%s-%s.txt' % (startNum, endNum)

數(shù)據(jù)格式 271004 httperror

如何不間斷得爬蟲信息

本項目有一個考慮：是不間斷爬取信息，如果因為斷網(wǎng)、BBS服務器故障啥的，我的爬蟲程序就退出的話。那我們還得從間斷的地方繼續(xù)爬，或者更麻煩的是從頭開始爬。
所以，我采取的方法是，如果遇到故障，就把這些異常的id記錄下來。等一次遍歷之后，才對這些異常的id進行重新爬取性別。
本文系列（一）給出了一個 getInfo(myurl, seWord)，通過給定鏈接和給定正則表達式爬取信息。
這個函數(shù)可以用來查看性別的最后活動時間信息。
我們再定義一個安全的爬取函數(shù)，不會間斷程序運行的，這兒用到try except異常處理。

這兒代碼試了兩次getInfo(myurl, seWord),如果第2次還是拋出異常了，就把這個id保存在file5里面
如果能獲取到信息，就返回信息

file5 = 'ruisi//httperror%s-%s.txt' % (startNum, endNum)def safeGet(myid, myurl, seWord):  try:    return getInfo(myurl, seWord)  except:    try:      return getInfo(myurl, seWord)    except:      httperrorfile = open(file5, 'a')      info = '%d %s/n' % (myid, 'httperror')      httperrorfile.write(info)      httperrorfile.close()      return 'httperror'

依次遍歷，獲取id從[1,300,000]的用戶信息

我們定義一個函數(shù)，這兒的思路是獲取sex和time，如果有sex，進而繼續(xù)判斷是否有time；如果沒sex，判斷是否這個用戶不存在還是性別無法爬取。

其中要考慮到斷網(wǎng)或者BBS服務器故障的情況。

url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s'url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile'def searchWeb(idArr):  for id in idArr:    sexUrl = url1 % (id) #將%s替換為id    timeUrl = url2 % (id)    sex = safeGet(id,sexUrl, sexRe)    if not sex: #如果sexUrl里面找不到性別，在timeUrl再嘗試找一下      sex = safeGet(id,timeUrl, sexRe)    time = safeGet(id,timeUrl, timeRe)    #如果出現(xiàn)了httperror，需要重新爬取    if (sex is 'httperror') or (time is 'httperror') :      pass    else:      if sex:        info = '%d %s' % (id, sex)        if time:          info = '%s %s/n' % (info, time)          wfile = open(file1, 'a')          wfile.write(info)          wfile.close()        else:          info = '%s %s/n' % (info, 'notime')          errtimefile = open(file2, 'a')          errtimefile.write(info)          errtimefile.close()      else:        #這兒是性別是None，然后確定一下是不是用戶不存在        #斷網(wǎng)的時候加上這個，會導致4個重復httperror        #可能用戶的性別我們無法知道，他沒有填寫        notexist = safeGet(id,sexUrl, notexistRe)        if notexist is 'httperror':          pass        else:          if notexist:            notexistfile = open(file3, 'a')            info = '%d %s/n' % (id, 'notexist')            notexistfile.write(info)            notexistfile.close()          else:            unkownsexfile = open(file4, 'a')            info = '%d %s/n' % (id, 'unkownsex')            unkownsexfile.write(info)            unkownsexfile.close()

這兒后期檢查發(fā)現(xiàn)了一個問題

 sex = safeGet(id,sexUrl, sexRe) if not sex:   sex = safeGet(id,timeUrl, sexRe) time = safeGet(id,timeUrl, timeRe)

這個代碼如果斷網(wǎng)的時候，調(diào)用了3次safeGet，每次調(diào)用都會往文本里面同一個id寫多次httperror

251538 httperror251538 httperror251538 httperror251538 httperror

多線程爬取信息？

數(shù)據(jù)統(tǒng)計可以用多線程，因為是獨立的多個文本
1、Popen介紹

使用Popen可以自定義標準輸入、標準輸出和標準錯誤輸出。我在SAP實習的時候，項目組在linux平臺下經(jīng)常使用Popen，可能是因為可以方便重定向輸出。

下面這段代碼借鑒了以前項目組的實現(xiàn)方法，Popen可以調(diào)用系統(tǒng)cmd命令。下面3個communicate()連在一起表示要等這3個線程都結(jié)束。

疑惑？
試驗了一下，必須3個communicate()緊挨著才能保證3個線程同時開啟，最后等待3個線程都結(jié)束。

p1=Popen(['python', 'ruisi.py', str(s0),str(s1)],bufsize=10000, stdout=subprocess.PIPE)p2=Popen(['python', 'ruisi.py', str(s1),str(s2)],bufsize=10000, stdout=subprocess.PIPE)p3=Popen(['python', 'ruisi.py', str(s2),str(s3)],bufsize=10000, stdout=subprocess.PIPE)p1.communicate()p2.communicate()p3.communicate()

2、定義一個單線程的爬蟲

用法：python ruisi.py <startNum> <endNum>

這段代碼就是爬取[startNum, endNum)信息，輸出到相應的文本里。它是一個單線程的程序，若要實現(xiàn)多線程的話，在外部調(diào)用它的地方實現(xiàn)多線程。

# ruisi.py# coding=utf-8import urllib2, re, sys, threading, time,thread# myurl as 指定鏈接# seWord as 正則表達式，用unicode表示# 返回根據(jù)正則表達式匹配的信息或者Nonedef getInfo(myurl, seWord):  headers = {    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'  }  req = urllib2.Request(    url=myurl,    headers=headers  )  time.sleep(0.3)  response = urllib2.urlopen(req)  html = response.read()  html = unicode(html, 'utf-8')  timeMatch = seWord.search(html)  if timeMatch:    s = timeMatch.groups()    return s[0]  else:    return None#嘗試兩次getInfo()#第2次失敗后，就把這個id標記為httperrordef safeGet(myid, myurl, seWord):  try:    return getInfo(myurl, seWord)  except:    try:      return getInfo(myurl, seWord)    except:      httperrorfile = open(file5, 'a')      info = '%d %s/n' % (myid, 'httperror')      httperrorfile.write(info)      httperrorfile.close()      return 'httperror'#輸出一個 idArr 范圍，比如[1,1001)def searchWeb(idArr):  for id in idArr:    sexUrl = url1 % (id)    timeUrl = url2 % (id)    sex = safeGet(id,sexUrl, sexRe)    if not sex:      sex = safeGet(id,timeUrl, sexRe)    time = safeGet(id,timeUrl, timeRe)    if (sex is 'httperror') or (time is 'httperror') :      pass    else:      if sex:        info = '%d %s' % (id, sex)        if time:          info = '%s %s/n' % (info, time)          wfile = open(file1, 'a')          wfile.write(info)          wfile.close()        else:          info = '%s %s/n' % (info, 'notime')          errtimefile = open(file2, 'a')          errtimefile.write(info)          errtimefile.close()      else:        notexist = safeGet(id,sexUrl, notexistRe)        if notexist is 'httperror':          pass        else:          if notexist:            notexistfile = open(file3, 'a')            info = '%d %s/n' % (id, 'notexist')            notexistfile.write(info)            notexistfile.close()          else:            unkownsexfile = open(file4, 'a')            info = '%d %s/n' % (id, 'unkownsex')            unkownsexfile.write(info)            unkownsexfile.close()def main():  reload(sys)  sys.setdefaultencoding('utf-8')  if len(sys.argv) != 3:    print 'usage: python ruisi.py <startNum> <endNum>'    sys.exit(-1)  global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5  startNum=int(sys.argv[1])  endNum=int(sys.argv[2])  sexRe = re.compile(u'em>/u6027/u522b</em>(.*?)</li')  timeRe = re.compile(u'em>/u4e0a/u6b21/u6d3b/u52a8/u65f6/u95f4</em>(.*?)</li')  notexistRe = re.compile(u'(p>)/u62b1/u6b49/uff0c/u60a8/u6307/u5b9a/u7684/u7528/u6237/u7a7a/u95f4/u4e0d/u5b58/u5728<')  url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s'  url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile'  file1 = '..//newRuisi//correct%s-%s.txt' % (startNum, endNum)  file2 = '..//newRuisi//errTime%s-%s.txt' % (startNum, endNum)  file3 = '..//newRuisi//notexist%s-%s.txt' % (startNum, endNum)  file4 = '..//newRuisi//unkownsex%s-%s.txt' % (startNum, endNum)  file5 = '..//newRuisi//httperror%s-%s.txt' % (startNum, endNum)  searchWeb(xrange(startNum,endNum))  # numThread = 10  # searchWeb(xrange(endNum))  # total = 0  # for i in xrange(numThread):  # data = xrange(1+i,endNum,numThread)  #   total =+ len(data)  #   t=threading.Thread(target=searchWeb,args=(data,))  #   t.start()  # print totalmain()

多線程爬蟲

代碼

# coding=utf-8from subprocess import Popenimport subprocessimport threading,timestartn = 1endn = 300001step =1000total = (endn - startn + 1 ) /stepISOTIMEFORMAT='%Y-%m-%d %X'#hardcode 3 threads#

感谢您访问我们的网站，您可能还对以下资源感兴趣：
国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片









主站蜘蛛池模板：
天全县|
怀远县|
新疆|
建平县|
大田县|
资溪县|
汤原县|
德江县|
台北市|
永德县|
文水县|
滨州市|
莆田市|
鄢陵县|
宣恩县|
武义县|
大庆市|
司法|
巴青县|
成武县|
土默特右旗|
舞钢市|
海安县|
麟游县|
贞丰县|
凭祥市|
湟源县|
云和县|
阿拉善左旗|
江口县|
黑山县|
望城县|
前郭尔|
五河县|
宜丰县|
威海市|
济南市|
淳化县|
黔东|
永善县|
石楼县|

国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

python實現(xiàn)爬蟲統(tǒng)計學校BBS男女比例之多線程爬蟲（二）