調(diào)研目標(biāo)網(wǎng)站背景1 檢查robotstxt2 檢查網(wǎng)站地圖3 估算網(wǎng)站大小4 識(shí)別網(wǎng)站所有技術(shù)5 尋找網(wǎng)站所有者第一個(gè)網(wǎng)絡(luò)爬蟲1 下載網(wǎng)頁重試下載設(shè)置用戶代理user_agent2 爬取網(wǎng)站地圖3 遍歷每個(gè)網(wǎng)頁的數(shù)據(jù)庫ID4 跟蹤網(wǎng)頁鏈接高級功能解析robotstxt支持代理PRoxy下載限速避免爬蟲陷阱最終版本
1 調(diào)研目標(biāo)網(wǎng)站背景
1.1 檢查robots.txt
http://example.webscraping.com/robots.txt
# section 1User-agent: BadCrawlerDisallow: /# section 2User-agent: *Crawl-delay: 5Disallow: /trap # section 3Sitemap: http://example.webscraping.com/sitemap.xmlsection 1 :禁止用戶代理為BadCrawler的爬蟲爬取該網(wǎng)站,除非惡意爬蟲。section 2 :兩次下載請求時(shí)間間隔5秒的爬取延遲。/trap 用于封禁惡意爬蟲,會(huì)封禁1分鐘不止。section 3 :定義一個(gè)Sitemap文件,下節(jié)講。1.2 檢查網(wǎng)站地圖
所有網(wǎng)頁鏈接: http://example.webscraping.com/sitemap.xml
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>http://example.webscraping.com/view/Afghanistan-1</loc></url><url><loc>http://example.webscraping.com/view/Aland-Islands-2</loc></url>...<url><loc>http://example.webscraping.com/view/Zimbabwe-252</loc></url></urlset>1.3 估算網(wǎng)站大小
高級搜索參數(shù):http://www.google.com/advanced_search Google搜索:site:http://example.webscraping.com/ 有202個(gè)網(wǎng)頁 Google搜索:site:http://example.webscraping.com/view 有117個(gè)網(wǎng)頁
1.4 識(shí)別網(wǎng)站所有技術(shù)
用buildwith模塊可以檢查網(wǎng)站構(gòu)建的技術(shù)類型。 安裝庫:pip install buildwith
>>> import builtwith>>> builtwith.parse('http://example.webscraping.com'){u'javascript-frameworks': [u'jQuery', u'Modernizr', u'jQuery UI'], u'web-frameworks': [u'Web2py', u'Twitter Bootstrap'], u'programming-languages': [u'Python'], u'web-servers': [u'Nginx']}>>> 示例網(wǎng)址使用了Python的Web2py框架,還使用了Javascript庫,可能是嵌入在HTML中的。這種容易抓取。其他建構(gòu)類型: - AngularJS:內(nèi)容動(dòng)態(tài)加載 - asp.net:爬取網(wǎng)頁要用到會(huì)話管理和表單提交。(第5章和第6章)
1.5 尋找網(wǎng)站所有者
用WHOIS協(xié)議查詢域名注冊者。 文檔:https://pypi.python.org/pypi/python-whois 安裝:pip install python-whois
>>> import whois>>> print whois.whois('appspot.com'){ ...... "name_servers": [ "NS1.GOOGLE.COM", ... "ns2.google.com", "ns1.google.com" ], "org": "Google Inc.", "creation_date": [ "2005-03-10 00:00:00", "2005-03-09T18:27:55-0800" ], "emails": [ "abusecomplaints@markmonitor.com", "dns-admin@google.com" ]}該域名歸屬于Google,用Google App Engine服務(wù)。注:Google經(jīng)常會(huì)阻斷網(wǎng)絡(luò)爬蟲!
2 第一個(gè)網(wǎng)絡(luò)爬蟲
爬取(Crawling)一個(gè)網(wǎng)站的方法有很多,選用哪種方法更加合適取決于目標(biāo)網(wǎng)站的結(jié)構(gòu)。 這里先探討如何安全地1.4.1下載網(wǎng)頁,然后介紹3種爬取網(wǎng)站方法: - 2.2爬取網(wǎng)站地圖; - 2.3遍歷每個(gè)網(wǎng)頁的數(shù)據(jù)庫ID; - 2.4跟蹤網(wǎng)頁鏈接。
2.1 下載網(wǎng)頁
1.4.1download1.py
# -*- coding: utf-8 -*-import urllib2def download1(url): """Simple downloader""" return urllib2.urlopen(url).read()if __name__ == '__main__': print download1('https://www.baidu.com')1.4.1download2.py
def download2(url): """Download function that catches errors""" print 'Downloading:', url try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None return html1.重試下載
當(dāng)服務(wù)器過載返回503 Service Unavailable錯(cuò)誤,我們可以嘗試重新下載。如果是404 Not Found這種錯(cuò)誤,說明網(wǎng)頁目前并不存在,嘗試兩樣的請求也沒有。 1.4.1download3.py
def download3(url, num_retries=2): """Download function that also retries 5XX errors""" print 'Downloading:', url try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # retry 5XX HTTP errors html = download3(url, num_retries-1) return htmldownload = download3if __name__ == '__main__': print download('http://httpstat.us/500')互聯(lián)網(wǎng)工程任務(wù)組定義了HTTP錯(cuò)誤的完整列表:https://tools.ietf.org/html/rfc7231#section-6 - 4××:錯(cuò)誤出現(xiàn)在請求存在問題 - 5××:錯(cuò)誤出現(xiàn)在服務(wù)端問題
2.設(shè)置用戶代理(user_agent)
默認(rèn)情況下,urllib2使用Python-urllib/2.7作為用戶代理下載網(wǎng)頁內(nèi)容的,其中2.7是Python的版本號(hào)。如果質(zhì)量不加的Python網(wǎng)絡(luò)的爬蟲(上面的代碼)有會(huì)造成服務(wù)器過載,一些網(wǎng)站還會(huì)封禁這個(gè)默認(rèn)用戶代理。比如,使用Python默認(rèn)用戶代理的情況下,訪問https://www.meetup.com/ ,會(huì)出現(xiàn):
wu_being@Ubuntukylin64:~/GitHub/WebScrapingWithPython/1.網(wǎng)絡(luò)爬蟲簡介$ python 1.4.1download4.py Downloading: https://www.meetup.com/Download error: [Errno 104] Connection reset by peerNonewu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/1.網(wǎng)絡(luò)爬蟲簡介$ python 1.4.1download4.py Downloading: https://www.meetup.com/Download error: ForbiddenNone為了下載更加可靠,我們需要設(shè)定控制用戶代理,如下代碼設(shè)定了一個(gè)用戶代理Wu_Being。
def download4(url, user_agent='Wu_Being', num_retries=2): """Download function that includes user agent support""" print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # retry 5XX HTTP errors html = download4(url, user_agent, num_retries-1) return html2.2 爬取網(wǎng)站地圖
我們從示例網(wǎng)址的robots.txt文件中發(fā)現(xiàn)的網(wǎng)站地圖sitemap.xml來下載所有網(wǎng)頁。為了解析網(wǎng)站地圖,我們用一個(gè)簡單的正則表達(dá)式從<loc>標(biāo)簽提取出URL。下一章介紹一種更加鍵壯的解析方法——CSS選擇器
# -*- coding: utf-8 -*-import refrom common import Downloaddef crawl_sitemap(url): # download the sitemap file sitemap = Download(url)#>Downloading: http://example.webscraping.com/sitemap.xml # extract the sitemap links links = re.findall('<loc>(.*?)</loc>', sitemap) # download each link for link in links: html = Download(link) # scrape html here # ...#>Downloading: http://example.webscraping.com/view/Afghanistan-1#>Downloading: http://example.webscraping.com/view/Aland-Islands-2#>Downloading: http://example.webscraping.com/view/Albania-3#>......if __name__ == '__main__': crawl_sitemap('http://example.webscraping.com/sitemap.xml')2.3 遍歷每個(gè)網(wǎng)頁的數(shù)據(jù)庫ID
http://example.webscraping.com/view/Afghanistan-1http://example.webscraping.com/view/China-47http://example.webscraping.com/view/Zimbabwe-252由于這些URL只有后綴不同,輸入http://example.webscraping.com/view/47 也能正常顯示China頁面,所有我們可以遍歷ID下載所有國家頁面。
import itertoolsfrom common import Downloaddef iteration(): for page in itertools.count(1): url = 'http://example.webscraping.com/view/-%d' % page #url = 'http://example.webscraping.com/view/-{}'.format(page) html = Download(url) if html is None: # received an error trying to download this webpage # so assume have reached the last country ID and can stop downloading break else: # success - can scrape the result # ... pass如果有的ID是不連續(xù)的,爬蟲到某個(gè)斷點(diǎn)就會(huì)退出,可以修改為連續(xù)5次下載錯(cuò)誤才會(huì)停止遍歷。
def iteration(): max_errors = 5 # maximum number of consecutive download errors allowed num_errors = 0 # current number of consecutive download errors for page in itertools.count(1): url = 'http://example.webscraping.com/view/-{}'.format(page) html = download(url) if html is None: # received an error trying to download this webpage num_errors += 1 if num_errors == max_errors: # reached maximum amount of errors in a row so exit break # so assume have reached the last country ID and can stop downloading else: # success - can scrape the result # ... num_errors = 0有些網(wǎng)站不使用連續(xù)的ID,或不使用數(shù)值的ID,這個(gè)方法就難于發(fā)揮作用了。
2.4 跟蹤網(wǎng)頁鏈接
我們需要讓爬蟲更像普通用戶,可以跟蹤鏈接,訪問感興趣的內(nèi)容。但容易下載大量我們不需要的網(wǎng)頁,如我們從一個(gè)論壇爬取用戶賬號(hào)詳情頁,不需要其他頁面,我們則需要用正則表達(dá)式來確定哪個(gè)頁面。
Downloading: http://example.webscraping.comDownloading: /index/1Traceback (most recent call last): File "1.4.4link_crawler1.py", line 29, in <module> link_crawler('http://example.webscraping.com', '/(index|view)') File "1.4.4link_crawler1.py", line 11, in link_crawler html = Download(url)... File "/usr/lib/python2.7/urllib2.py", line 283, in get_type raise ValueError, "unknown url type: %s" % self.__originalValueError: unknown url type: /index/1由于/index/1是相對鏈接,瀏覽器可以識(shí)別,但urllib2無法知道上下文,所有我們可以用urlparse模塊來轉(zhuǎn)換為絕對鏈接。
def link_crawler(seed_url, link_regex): crawl_queue = [seed_url] seen = set(crawl_queue) # keep track which URL's have seen before while crawl_queue: url = crawl_queue.pop() html = Download(url) for link in get_links(html): if re.match(link_regex, link): #匹配正則表達(dá)式 link = urlparse.urljoin(seed_url, link) crawl_queue.append(link)上面這段代碼還是有問題,這些地點(diǎn)相互之間存在鏈接,澳大利亞鏈接到南極洲,南極洲鏈接到澳大利亞,這樣爬蟲就會(huì)在不斷循環(huán)下載同樣的內(nèi)容。為了避免重復(fù)下載,修改上面函數(shù)具備存儲(chǔ)發(fā)現(xiàn)URL的功能。
def link_crawler(seed_url, link_regex): """Crawl from the given seed URL following links matched by link_regex """ crawl_queue = [seed_url] seen = set(crawl_queue) # keep track which URL's have seen before while crawl_queue: url = crawl_queue.pop() html = Download(url) for link in get_links(html): # check if link matches expected regex if re.match(link_regex, link): # form absolute link link = urlparse.urljoin(seed_url, link) # check if have already seen this link if link not in seen: seen.add(link) crawl_queue.append(link)高級功能
1.解析robots.txt
robotparser模塊首先加載robots.txt文件,然后通過can_fetch()函數(shù)確定指定的用戶代理是否允許訪問網(wǎng)頁。
>>> import robotparser>>> rp=robotparser.RobotFileParser()>>> rp.set_url('http://example.webscraping.com/robots.txt')>>> rp.read()>>> url='http://example.webscraping.com'>>> user_agent='BadCrawler'>>> rp.can_fetch(user_agent,url)False>>> user_agent='GoodCrawler'>>> rp.can_fetch(user_agent,url)True>>> user_agent='Wu_Being'>>> rp.can_fetch(user_agent,url)True為了將該功能集成到爬蟲中,我們需要在crawl循環(huán)中添加該檢查。
while crawl_queue: url = crawl_queue.pop() # check url passes robots.txt restrictions if rp.can_fetch(user_agent, url): ... else: print 'Blocked by robots.txt:', url2.支持代理(Proxy)
有時(shí)我們需要使用代理訪問某個(gè)網(wǎng)站。比如Netflix屏蔽美國以外的大多數(shù)國家。使用urllib2支持代理沒有想象中那么容易(可以嘗試用更好友的Python HTTP模塊requests來實(shí)現(xiàn)這個(gè)功能,文檔:http://docs.python-requests.org )。下面是使用urllib2支持代理的代碼。
def download5(url, user_agent='wswp', proxy=None, num_retries=2): """Download function with support for proxies""" print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) opener = urllib2.build_opener() if proxy: proxy_params = {urlparse.urlparse(url).scheme: proxy} opener.add_handler(urllib2.ProxyHandler(proxy_params)) try: html = opener.open(request).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # retry 5XX HTTP errors html = download5(url, user_agent, proxy, num_retries-1) return html3.下載限速
當(dāng)我們爬取的網(wǎng)站過快,可能會(huì)被封禁或造成服務(wù)器過載的風(fēng)險(xiǎn)。為了降低這些風(fēng)險(xiǎn),我們可以在兩次下載之間添加延時(shí),從而對爬蟲限速。
class Throttle: """Throttle downloading by sleeping between requests to same domain """ def __init__(self, delay): # amount of delay between downloads for each domain self.delay = delay # timestamp of when a domain was last accessed self.domains = {} def wait(self, url): domain = urlparse.urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_secs = self.delay - (datetime.now() - last_accessed).seconds if sleep_secs > 0: time.sleep(sleep_secs) self.domains[domain] = datetime.now()Throttle類記錄每個(gè)上次訪問的時(shí)間,如果當(dāng)前時(shí)間距離上次訪問時(shí)間小于指定延時(shí),則執(zhí)行睡眠操作。我們可以在每次下載之前調(diào)用Throttle對爬蟲進(jìn)行限速。
throttle = Throttle(delay)...throttle.wait(url)html = download(url, headers, proxy=proxy, num_retries=num_retries)4.避免爬蟲陷阱
想要避免陷入爬蟲陷阱,一人簡單的方法就是記錄到達(dá)當(dāng)前網(wǎng)頁經(jīng)過了多少個(gè)鏈接,也就是深度。當(dāng)達(dá)到最大嘗試就不再向隊(duì)列中添加該網(wǎng)頁中的鏈接了,我們需要修改seen變量為一個(gè)字典,增加頁面嘗試的記錄。如果想禁用該功能,只需將max_depth設(shè)為一個(gè)負(fù)數(shù)即可。
def link_crawler(..., max_depth=2): seen = {seed_url: 0} ... depth = seen[url] if depth != max_depth: for link in links: if link not in seen: seen[link] = depth + 1 crawl_queue.append(link)5.最終版本
1.4.4link_crawler4_UltimateVersion.py
# coding:utf-8import reimport urlparseimport urllib2import timefrom datetime import datetimeimport robotparserimport Queuedef link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='wswp', proxy=None, num_retries=1): """Crawl from the given seed URL following links matched by link_regex """ # the queue of URL's that still need to be crawled crawl_queue = Queue.deque([seed_url]) # the URL's that have been seen and at what depth seen = {seed_url: 0} # track how many URL's have been downloaded num_urls = 0 rp = get_robots(seed_url) throttle = Throttle(delay) headers = headers or {} if user_agent: headers['User-agent'] = user_agent while crawl_queue: url = crawl_queue.pop() # check url passes robots.txt restrictions if rp.can_fetch(user_agent, url): #if get_robots(seed_url): throttle.wait(url) html = download(url, headers, proxy=proxy, num_retries=num_retries) links = [] depth = seen[url] if depth != max_depth: # can still crawl further if link_regex: # filter for links matching our regular expression links.extend(link for link in get_links(html) if re.match(link_regex, link)) for link in links: link = normalize(seed_url, link) # check whether already crawled this link if link not in seen: seen[link] = depth + 1 # check link is within same domain if same_domain(seed_url, link): # success! add this new link to queue crawl_queue.append(link) # check whether have reached downloaded maximum num_urls += 1 if num_urls == max_urls: break else: print 'Blocked by robots.txt:', urlclass Throttle: """Throttle downloading by sleeping between requests to same domain """ def __init__(self, delay): # amount of delay between downloads for each domain self.delay = delay # timestamp of when a domain was last accessed self.domains = {} def wait(self, url): domain = urlparse.urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_secs = self.delay - (datetime.now() - last_accessed).seconds if sleep_secs > 0: time.sleep(sleep_secs) self.domains[domain] = datetime.now()def download(url, headers, proxy, num_retries, data=None): print 'Downloading:', url request = urllib2.Request(url, data, headers) opener = urllib2.build_opener() if proxy: proxy_params = {urlparse.urlparse(url).scheme: proxy} opener.add_handler(urllib2.ProxyHandler(proxy_params)) try: response = opener.open(request) html = response.read() code = response.code except urllib2.URLError as e: print 'Download error:', e.reason html = '' if hasattr(e, 'code'): code = e.code if num_retries > 0 and 500 <= code < 600: # retry 5XX HTTP errors return download(url, headers, proxy, num_retries-1, data) else: code = None return htmldef normalize(seed_url, link): """Normalize this URL by removing hash and adding domain """ link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicates 把url #后的部分賦給量變 _ return urlparse.urljoin(seed_url, link) #連接url協(xié)議域名部分和虛擬目錄部分def same_domain(url1, url2): """Return True if both URL's belong to same domain """ return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netlocdef get_robots(url): """Initialize robots parser for this domain """ rp = robotparser.RobotFileParser() rp.set_url(urlparse.urljoin(url, '/robots.txt')) rp.read() return rpdef get_links(html): """Return a list of links from html """ # a regular expression to extract all links from the webpage webpage_regex = re.compile('<a[^>]+href=["/'](.*?)["/']', re.IGNORECASE) # list of all links from the webpage return webpage_regex.findall(html)if __name__ == '__main__': #link_crawler('http://example.webscraping.com', '/(index|view)', delay=0, num_retries=1, user_agent='BadCrawler') #link_crawler('http://example.webscraping.com', '/(index|view)', delay=0, num_retries=1, max_depth=3, user_agent='GoodCrawler') link_crawler('http://127.0.0.1:8000/places', '/places/default/(index|view)', delay=0, num_retries=1, max_depth=2, user_agent='GoodCrawler') #http://127.0.0.1:8000/places/static/robots.txt Wu_Being 博客聲明:本人博客歡迎轉(zhuǎn)載,請標(biāo)明博客原文和原鏈接!謝謝! 【Python爬蟲系列】《【Python爬蟲1】網(wǎng)絡(luò)爬蟲簡介》http://blog.csdn.net/u014134180/article/details/55506864 Python爬蟲系列的GitHub代碼文件:https://github.com/1040003585/WebScrapingWithPython