Python：抓取百度SERP搜索結果頁的網站標題信息

2019-11-14 17:45:32

字體：大中小

來源：轉載

供稿：網友

比如，你想采集標題中包含“58同城”的SERP結果，并過濾包含有“北京”或“廈門”等結果數據。

該Python腳本主要是實現以上功能。

其中，使用BeautifulSoup來解析HTML，可以參考我的另外一篇文章：Windows8下安裝BeautifulSoup

代碼如下：

__author__ = '曾是土木人'# -*- coding: utf-8 -*-#采集SERP搜索結果標題import urllib2from bs4 import BeautifulSoupimport time#寫文件def WriteFile(fileName,content):    try:        fp = file(fileName,"a+")        fp.write(content + "/r")        fp.close()    except:        pass#獲取Html源碼def GetHtml(url):    try:        req = urllib2.Request(url)        response= urllib2.urlopen(req,None,3)#設置超時時間        data    = response.read().decode('utf-8','ignore')    except:pass    return data#提取搜索結果SERP的標題def FetchTitle(html):    try:        soup = BeautifulSoup(''.join(html))        for i in soup.findAll("h3"):            title = i.text.encode("utf-8")　　　　　　 
　　　　　　　if any(str_ in title for str_ in ("北京","廈門")):
　　　　　　　　  continue            else:                PRint title            WriteFile("Result.txt",title)    except:        passkeyWord = "58同城"if __name__ == "__main__":    global keyword    start = time.time()    for i in range(0,8):        url = "http://www.baidu.com/s?wd=intitle:"+keyword+"&rn=100&pn="+str(i*100)        html = GetHtml(url)        FetchTitle(html)        time.sleep(1)    c = time.time() - start    print('程序運行耗時:%0.2f 秒'%(c))