【Python爬蟲8】Scrapy 爬蟲框架

2019-11-08 03:19:13

字體：大中小

供稿：網(wǎng)友

安裝Scrapy新建項(xiàng)目1定義模型2創(chuàng)建爬蟲3優(yōu)化設(shè)置4測(cè)試爬蟲5使用shell命令提取數(shù)據(jù)6提取數(shù)據(jù)保存到文件中7中斷和恢復(fù)爬蟲使用Portia編寫可視化爬蟲1安裝2標(biāo)注3優(yōu)化爬蟲4檢查結(jié)果使用Scrapely實(shí)現(xiàn)自動(dòng)化提取

1.安裝Scrapy

用pip命令安裝Scrapy：pip install Scrapy

wu_being@Ubuntukylin64:~/GitHub/WebScrapingWithPython$ scrapy -hScrapy 1.3.0 - no active PRojectUsage: scrapy <command> [options] [args]Available commands: bench Run quick benchmark test commands fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directoryUse "scrapy <command> -h" to see more info about a commandwu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython$

本篇會(huì)用到下面幾個(gè)命令： - startproject：創(chuàng)建一人新項(xiàng)目 - genspider：根據(jù)模板生成一個(gè)新爬蟲 - crawl：執(zhí)行爬蟲 - shell：?jiǎn)?dòng)交互式提取控制臺(tái)

文檔：http://doc.scrapy.org/latest/topics/commands.html

2.新建項(xiàng)目

輸入scrapy startproject <project_name>新建項(xiàng)目，這里使用example_wu為項(xiàng)目名。

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架$ scrapy startproject Usage===== scrapy startproject <project_name> [project_dir]...wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架$ scrapy startproject example_wuNew Scrapy project 'example_wu', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in: /home/wu_being/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wuYou can start your first spider with: cd example_wu scrapy genspider example example.comwu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$ lsexample_wu scrapy.cfgwu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$

下面是新建項(xiàng)目的默認(rèn)目錄結(jié)構(gòu)：

scrapy.cfgexample_wu/ __init__.py items.py middlewares.py pipelines.py setting.py spiders/ __init__.py

下面是重要的幾個(gè)文件說(shuō)明： - scrapy.cfg：設(shè)置項(xiàng)目配置（不用修改） - items.py：定義待提取域的模型 - pipelines.py：處理要提取的域（不用修改） - setting.py：定義一些設(shè)置，如用戶代理、提取延時(shí)等 - spiders/：該目錄存儲(chǔ)實(shí)際的爬蟲代碼

2.1定義模型

example_wu/items.py默認(rèn)代碼如下：

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ExampleWuItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass

ExampleWuItem類是一個(gè)模板，需要將其中的內(nèi)容替換為爬蟲運(yùn)行時(shí)想要存儲(chǔ)的待提取的國(guó)家信息，我們這里設(shè)置只提取國(guó)家名稱和人口數(shù)量，把默認(rèn)代碼修改為：

import scrapyclass ExampleWuItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() population=scrapy.Field()

2.2創(chuàng)建爬蟲

現(xiàn)在我們開(kāi)始編寫真正的爬蟲代碼，又稱為spider，通過(guò)genspider命令，傳入爬蟲名、域名和可選模板參數(shù)： scrapy genspider country 127.0.0.1:8000/places --template=crawl

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$ scrapy genspideUsage===== scrapy genspider [options] <name> <domain>Generate new spider using pre-defined templatesOptions=======--help, -h show this help message and exit--list, -l List available templates--edit, -e Edit spider after creating it--dump=TEMPLATE, -d TEMPLATE Dump template to standard output--template=TEMPLATE, -t TEMPLATE Uses a custom template.--force If the spider already exists, overwrite it with the template...wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$ scrapy genspider --listAvailable templates: basic crawl csvfeed xmlfeedwu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$ scrapy genspider country 127.0.0.1:8000/places --template=crawlCreated spider 'country' using template 'crawl' in module: example_wu.spiders.countrywu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$

這里使用內(nèi)置crawl模板，可以生成更加接近我們想要的國(guó)家爬蟲初始版本。運(yùn)行g(shù)enspider命令之后，將會(huì)生成代碼example_wu/spiders/country.py。

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass CountrySpider(CrawlSpider): name = 'country' #allowed_domains = ['127.0.0.1:8000/places'] ###!!!!這個(gè)不是域名 start_urls = ['http://127.0.0.1:8000/places/'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), ) def parse_item(self, response): i = {} #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() #i['name'] = response.xpath('//div[@id="name"]').extract() #i['description'] = response.xpath('//div[@id="description"]').extract() return i

該類的屬性名： - name：定義爬蟲的名稱 - allowed_domains：定義可以提取的域名列表。如果沒(méi)有則表示可以提取任何域名!!!!!! - start_urls：定義爬蟲起始的URL列表。意思為可用的URL!!!!!! - rules：定義正則表達(dá)式集合，用于告知爬蟲需要跟蹤哪些鏈接。還有一個(gè)callback函數(shù)，用于解析下載得到的響應(yīng)，而parse_urls()示例方法給我們提供了一個(gè)從響應(yīng)中獲取數(shù)據(jù)的例子。

文檔：http://doc.scrapy.org/en/latest/topics/spiders.html

2.3優(yōu)化設(shè)置

默認(rèn)情況下，Scrapy對(duì)同一個(gè)域名允許最多16個(gè)并發(fā)下載，并且再次下載之間沒(méi)有延時(shí)，這樣爬蟲容易被服務(wù)器檢測(cè)到并被封禁，所以要在example_wu/settings.py添加幾行代碼：

# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 5# The download delay setting will honor only one of:CONCURRENT_REQUESTS_PER_DOMAIN = 1#CONCURRENT_REQUESTS_PER_IP = 16

這里的延時(shí)不是精確的，精確的延時(shí)有時(shí)也可能被服務(wù)器檢測(cè)到被封禁，而Scrapy實(shí)際在兩次請(qǐng)求的延時(shí)添加隨機(jī)的偏移量。文檔：http://doc.scrapy.org/en/latest/topics/settings.html

2.4測(cè)試爬蟲

使用crawl運(yùn)行爬蟲，并附上爬蟲名稱。

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$ scrapy crawl country -s LOG_LEVEL=ERRORwu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$

發(fā)現(xiàn)終端日志沒(méi)有輸出錯(cuò)誤信息，命令的參數(shù)LOG_LEVEL=ERROR等同于在settings.py加一行LOG_LEVEL='ERROR'，默認(rèn)是在終端顯示所有日志信息。

rules = ( Rule(LinkExtractor(allow='/index/'), follow=True), Rule(LinkExtractor(allow='/view/'), callback='parse_item'),)

上面我們添加了兩條規(guī)則。第一條規(guī)則爬取索引頁(yè)并跟蹤其中的鏈接(遞歸爬取鏈接，默認(rèn)是True)，而第二條規(guī)則爬取國(guó)家頁(yè)面并將下載響應(yīng)傳給callback函數(shù)用于提取數(shù)據(jù)。

...2017-01-30 00:12:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/> (referer: None)2017-01-30 00:12:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Afghanistan-1> (referer: http://127.0.0.1:8000/places/)2017-01-30 00:12:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/index/1> (referer: http://127.0.0.1:8000/places/)2017-01-30 00:12:58 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://127.0.0.1:8000/places/default/index/1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)2017-01-30 00:13:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Antigua-and-Barbuda-10> (referer: http://127.0.0.1:8000/places/)......2017-01-30 00:14:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/user/login?_next=%2Fplaces%2Fdefault%2Findex%2F1> (referer: http://127.0.0.1:8000/places/default/index/1)2017-01-30 00:14:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/user/register?_next=%2Fplaces%2Fdefault%2Findex%2F1> (referer: http://127.0.0.1:8000/places/default/index/1)......

我們發(fā)現(xiàn)已經(jīng)自動(dòng)過(guò)濾了重復(fù)鏈接，但結(jié)果有多余的登錄頁(yè)和注冊(cè)頁(yè)，我們可以用正則表達(dá)式過(guò)濾。

rules = ( Rule(LinkExtractor(allow='/index/', deny='/user/'), follow=True), #False Rule(LinkExtractor(allow='/view/', deny='/user/'), callback='parse_item'),)

使用該類的文檔：http://doc.scrapy.org/en/latest/topics/linkextractors.html

2.5使用shell命令提取數(shù)據(jù)

scrapy提供了shell命令可以下載URL并在python解釋器中給出結(jié)果狀態(tài)。

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$ scrapy shell http://127.0.0.1:8000/places/default/view/47...2017-01-30 11:24:21 [scrapy.core.engine] INFO: Spider opened2017-01-30 11:24:21 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://127.0.0.1:8000/robots.txt> (referer: None)2017-01-30 11:24:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/47> (referer: None)[s] Available Scrapy objects:[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)[s] crawler <scrapy.crawler.Crawler object at 0x7fd8e6d5cbd0>[s] item {}[s] request <GET http://127.0.0.1:8000/places/default/view/47>[s] response <200 http://127.0.0.1:8000/places/default/view/47>[s] settings <scrapy.settings.Settings object at 0x7fd8e6d5c5d0>[s] spider <DefaultSpider 'default' at 0x7fd8e5b24c50>[s] Useful shortcuts:[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)[s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help)[s] view(response) View response in a browser>>>

下面我們來(lái)測(cè)試一下。

>>> >>> response<200 http://127.0.0.1:8000/places/default/view/47>>>> response.url'http://127.0.0.1:8000/places/default/view/47'>>> response.status200>>> item{}>>> >>>

scrapy可以使用lxml提取數(shù)據(jù)，這里用CSS選擇器。用extract()提取數(shù)據(jù)。

>>> response.css('#places_country__row > td.w2p_fw::text')[<Selector xpath=u"descendant-or-self::*[@id = 'places_country__row']/td[@class and contains(concat(' ', normalize-space(@class), ' '), ' w2p_fw ')]/text()" data=u'China'>]>>> name_css='#places_country__row > td.w2p_fw::text'>>> response.css(name_css)[<Selector xpath=u"descendant-or-self::*[@id = 'places_country__row']/td[@class and contains(concat(' ', normalize-space(@class), ' '), ' w2p_fw ')]/text()" data=u'China'>]>>> response.css(name_css).extract()[u'China']>>> >>> pop_css='#places_population__row > td.w2p_fw::text'>>> response.css(pop_css).extract()[u'1330044000']>>>

2.6提取數(shù)據(jù)保存到文件中

下面是該爬蟲的完整代碼。

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom example_wu.items import ExampleWuItem ###wuclass CountrySpider(CrawlSpider): name = 'country' #allowed_domains = ['127.0.0.1:8000/places']####domains!!!!這個(gè)不是域名 start_urls = ['http://127.0.0.1:8000/places/'] rules = ( Rule(LinkExtractor(allow='/index/', deny='/user/'), follow=True), #False Rule(LinkExtractor(allow='/view/', deny='/user/'), callback='parse_item'), ) def parse_item(self, response): item = ExampleWuItem() ###wu item['name'] = response.css('tr#places_country__row td.w2p_fw::text').extract() item['population'] = response.css('tr#places_population__row td.w2p_fw::text').extract() return item

要想保存結(jié)果，我們可以在parse_item()方法中添加保存提取數(shù)據(jù)的代碼，或是定義管道。不過(guò)scrapy提供了一個(gè)方便的--output選項(xiàng)，用于自動(dòng)保存提取的數(shù)據(jù)到CSV、JSON和XML文件中。

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$ scrapy crawl country -s LOG_LEVEL=DEBUG2017-01-30 13:09:52 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: example_wu)...2017-01-30 13:09:52 [scrapy.middleware] INFO: Enabled item pipelines:[]2017-01-30 13:09:52 [scrapy.core.engine] INFO: Spider opened2017-01-30 13:09:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2017-01-30 13:09:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:60232017-01-30 13:09:52 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://127.0.0.1:8000/robots.txt> (referer: None)2017-01-30 13:09:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/> (referer: None)2017-01-30 13:09:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Afghanistan-1> (referer: http://127.0.0.1:8000/places/)2017-01-30 13:09:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Afghanistan-1>{'name': [u'Afghanistan'], 'population': [u'29121286']}2017-01-30 13:09:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/index/1> (referer: http://127.0.0.1:8000/places/)2017-01-30 13:09:53 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://127.0.0.1:8000/places/default/index/1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)2017-01-30 13:09:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Antigua-and-Barbuda-10> (referer: http://127.0.0.1:8000/places/)2017-01-30 13:09:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Antigua-and-Barbuda-10>{'name': [u'Antigua and Barbuda'], 'population': [u'86754']}2017-01-30 13:09:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Antarctica-9> (referer: http://127.0.0.1:8000/places/)2017-01-30 13:09:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Antarctica-9>{'name': [u'Antarctica'], 'population': [u'0']}... ...wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$ scrapy crawl country -s LOG_LEVEL=INFO --output=countries.csv...2017-01-30 13:11:33 [scrapy.extensions.feedexport] INFO: Stored csv feed (252 items) in: countries.csv2017-01-30 13:11:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 160417, 'downloader/request_count': 280, 'downloader/request_method_count/GET': 280, 'downloader/response_bytes': 2844451, 'downloader/response_count': 280, 'downloader/response_status_count/200': 279, 'downloader/response_status_count/400': 1, 'dupefilter/filtered': 61, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 1, 30, 5, 11, 33, 487258), 'item_scraped_count': 252, 'log_count/INFO': 8, 'request_depth_max': 26, 'response_received_count': 280, 'scheduler/dequeued': 279, 'scheduler/dequeued/memory': 279, 'scheduler/enqueued': 279, 'scheduler/enqueued/memory': 279, 'start_time': datetime.datetime(2017, 1, 30, 5, 10, 34, 304933)}2017-01-30 13:11:33 [scrapy.core.engine] INFO: Spider closed (finished)wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$ ...

提取過(guò)程最后還輸出一些統(tǒng)計(jì)信息。我們查看輸出文件countries.csv的信息，結(jié)果和預(yù)期一樣。

name,populationAndorra,84000American Samoa,57881Algeria,34586184Albania,2986952Aland Islands,26711Afghanistan,29121286Antigua and Barbuda,86754Antarctica,0Anguilla,13254... ...

2.7中斷和恢復(fù)爬蟲

我們只需要定義用于保存爬蟲當(dāng)前狀態(tài)目錄的JOBDIR設(shè)置即可。

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$ scrapy crawl country -s LOG_LEVEL=DEBUG -s JOBDIR=2.7crawls/country...^C2017-01-30 13:31:27 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 2017-01-30 13:33:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Albania-3>{'name': [u'Albania'], 'population': [u'2986952']}2017-01-30 13:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Aland-Islands-2> (referer: http://127.0.0.1:8000/places/)2017-01-30 13:33:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Aland-Islands-2>{'name': [u'Aland Islands'], 'population': [u'26711']}...

我們通過(guò)按Ctrl+C發(fā)送終止信號(hào)，然后等待爬蟲再下載幾個(gè)條目后自動(dòng)終止，注意不能再按一次Ctrl+C強(qiáng)行終止，否則爬蟲保存狀態(tài)不成功。我們運(yùn)行同樣的命令恢復(fù)爬蟲運(yùn)行。

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu $wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$ scrapy crawl country -s LOG_LEVEL=DEBUG -s JOBDIR=2.7crawls/country...2017-01-30 13:33:21 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://127.0.0.1:8000/robots.txt> (referer: None)2017-01-30 13:33:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Barbados-20> (referer: http://127.0.0.1:8000/places/default/index/1)2017-01-30 13:33:23 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Barbados-20>{'name': [u'Barbados'], 'population': [u'285653']}2017-01-30 13:33:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Bangladesh-19> (referer: http://127.0.0.1:8000/places/default/index/1)2017-01-30 13:33:25 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Bangladesh-19>{'name': [u'Bangladesh'], 'population': [u'156118464']}...^C2017-01-30 13:33:43 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 2017-01-30 13:33:43 [scrapy.core.engine] INFO: Closing spider (shutdown)^C2017-01-30 13:33:44 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdownwu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$

恢復(fù)時(shí)注意cookie過(guò)期問(wèn)題。文檔：http://doc.scrapy.org/en/latest/topics/jobs.html

3.使用Portia編寫可視化爬蟲

Portia是一款基于scrapy開(kāi)發(fā)的開(kāi)源工具，該工具可以通過(guò)點(diǎn)擊要提取的網(wǎng)頁(yè)部分來(lái)創(chuàng)建爬蟲，這樣就比手式創(chuàng)建CSS選擇器的方式更加方便。文檔：https://github.com/scrapinghub/portia#running-portia

3.1安裝

先使用virtualenv創(chuàng)建一個(gè)虛擬python環(huán)境，環(huán)境名為portia_examle。 pip install virtualenv

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython$ virtualenv portia_examle --no-site-packagesNew python executable in /home/wu_being/GitHub/WebScrapingWithPython/portia_examle/bin/pythonInstalling setuptools, pip, wheel...done.wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython$ source portia_examle/bin/activate(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython$ (portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython$ cd portia_examle/(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/portia_examle$

在virtualenv中安裝Portia及依賴。

(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/portia_examle$ git clone https://github.com/scrapinghub/portiacd portiapip install -r requirements.txtpip install -e ./slybotcd slydtwistd -n slyd

如果安裝成功，在瀏覽器中訪問(wèn)到Portia工具h(yuǎn)ttp://localhost:9001/static/main.html

3.2標(biāo)注

Portia啟動(dòng)項(xiàng)，有一個(gè)用于輸入提取網(wǎng)頁(yè)URL的文本框，輸入http://127.0.0.1:8000/places/ 。默認(rèn)情況下，項(xiàng)目名被設(shè)為new_project，而爬蟲名被設(shè)為待爬取域名127.0.0.1:8000/places/，這兩項(xiàng)都通過(guò)單擊相應(yīng)標(biāo)簽進(jìn)行修改。單擊Annotate this page按鈕，然后單擊國(guó)家人口數(shù)量。單擊+field按鈕創(chuàng)建一個(gè)名為population的新域，然后單擊Done保存。其他的域也是相同操作。完成標(biāo)注后，單擊頂部的藍(lán)色按鈕Continue Browsing。

3.3優(yōu)化爬蟲

標(biāo)注完成后，Portia會(huì)生成一個(gè)scrapy項(xiàng)目，并將產(chǎn)生的文件保存到data/projects目錄中，要運(yùn)行爬蟲，只需執(zhí)行portiacrawl命令，并帶上項(xiàng)目名和爬蟲名。

(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/portia_examle$ portiacrawl portia/slyd/data/projects/new_project 如果爬蟲默認(rèn)設(shè)置運(yùn)行太快就遇到服務(wù)器錯(cuò)誤portiacrawl portia/slyd/data/projects/new_project 127.0.0.1:8000/places/ -s DOWNLOAD_DELAY = 2 -s CONCURRENT_REQUESTS_PER_DOMAIN = 1

配置右邊欄面板中的Crawling選項(xiàng)卡中，可以添加/index/和/view/為爬蟲跟蹤模式，將/user/為排除模式，并勾選Overlay blocked links復(fù)選框。

3.4檢查結(jié)果

(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/portia_examle$ portiacrawl portia/slyd/data/projects/new_project 127.0.0.1:8000/places/ --output=countries.csv -s DOWNLOAD_DELAY = 2 -s CONCURRENT_REQUESTS_PER_DOMAIN = 1

Portia是一個(gè)非常方便的與Scrapy配合的工具。對(duì)于簡(jiǎn)單的網(wǎng)站，使用Portia開(kāi)發(fā)爬蟲通常速度更快。而對(duì)于復(fù)雜的網(wǎng)站（比如依賴javaScript的界面），則可以選擇使用Python直接開(kāi)發(fā)Scrapy爬蟲。

4.使用Scrapely實(shí)現(xiàn)自動(dòng)化提取

Portia使用了Scrapely庫(kù)來(lái)訓(xùn)練數(shù)據(jù)建立從網(wǎng)頁(yè)中提取哪些內(nèi)容的模型，并在相同結(jié)構(gòu)的其他網(wǎng)頁(yè)應(yīng)用該模型。 https://github.com/scrapy/scrapely

(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/portia_examle$ pythonPython 2.7.12 (default, Nov 19 2016, 06:48:10) [GCC 5.4.0 20160609] on linux2Type "help", "copyright", "credits" or "license" for more information.>>> from scrapely import Scraper>>> s=Scraper()>>> train_url='http://127.0.0.1:8000/places/default/view/47'>>> s.train(train_url,{'name':'China','population':'1330044000'})>>> test_url='http://127.0.0.1:8000/places/default/view/239'>>> s.scrape(test_url)[{u'name':[u'United Kingdom'],u'population':[u'62,348,447']}]

Wu_Being 博客聲明：本人博客歡迎轉(zhuǎn)載，請(qǐng)標(biāo)明博客原文和原鏈接！謝謝！【Python爬蟲系列】《【Python爬蟲8】Scrapy 爬蟲框架》http://blog.csdn.net/u014134180/article/details/55508259 Python爬蟲系列的GitHub代碼文件：https://github.com/1040003585/WebScrapingWithPython

上一篇：【Python爬蟲9】Python網(wǎng)絡(luò)爬蟲實(shí)例實(shí)戰(zhàn)

下一篇：【Python爬蟲7】驗(yàn)證碼處理