(4)分布式下的爬蟲Scrapy應該如何做-規則自動爬取及命令行下傳參

2019-11-14 17:00:39

字體：大中小

來源：轉載

供稿：網友

本次探討的主題是規則爬取的實現及命令行下的自定義參數的傳遞，規則下的爬蟲在我看來才是真正意義上的爬蟲。

我們選從邏輯上來看，這種爬蟲是如何工作的：

我們給定一個起點的url link ，進入頁面之后提取所有的ur 鏈接，我們定義一個規則，根據規則(用正則表達式來限制)來提取我們想要的連接形式，然后爬取這些頁面，進行一步的處理(數據提取或者其它動作)，然后循環上述操作，直到停止，這個時候有一個潛在的問題，就是重復爬取，在scrapy 的框架下已經著手處理了這些問題，一般來說，對于爬取過濾的問題，通用的處理方式是建立一個地址表，在爬取之前查一下這個地址表，是否已經爬取過，如果是，則直接過濾掉。另一種就是使用現成的通用解決方案，bloom filter

本次討論的是如何使用CrawlSpider 來進行爬取豆瓣標簽下的所有小組的信息：

一，我們新建立一個類，繼承自CrawlSpider

from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom douban.items import GroupInfoclass MySpider(CrawlSpider):

關于CrawlSpider的更多說明，請參考：http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

二，為了完成命令行下的參數傳遞，我們需要在類的構造函數里面輸入我們想要的參數

：

在命令行下這樣使用：

scrapy crawl douban.xp --logfile=test.log -a target=%E6%96%87%E5%85%B7

這樣就可以將自定義的參數傳入到里面

這里特別說明最后的一行：super(MySpider, self).__init__()

我們轉到定義，查看CrawlSpider 的定義：

構造函數會調用私有方法編譯rules變量，如果在我們自己定義的Spider里面沒有調用方法，會直接報錯的。

三，編寫規則：

     self.rules = (            Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),            )

allow 定義想要提取標簽樣式，使用正則匹配，restrict_xpaths 嚴格限制這種標簽的范圍在指定的標簽內，callback ,提取到之后的回調函數。

四，全部代碼參考：

from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom douban.items import GroupInfoclass MySpider(CrawlSpider):    name = 'douban.xp'    current = ''    allowed_domains = ['douban.com']    def __init__(self, target=None):        if self.current is not '':            target = self.current        if target is not None:            self.current = target        self.start_urls = [                'http://www.douban.com/group/explore?tag=%s' % (target)            ]              self.rules = (            Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),            )        #call the father base function         super(MySpider, self).__init__()           def parse_next_page(self, response):        self.logger.info(msg='begin init the page %s ' % response.url)        list_item = response.xpath('//a[@class="nbg"]')        #check the group is not null         if list_item is None:            self.logger.info(msg='cant select anything in selector ')            return        for a_item in list_item:            item = GroupInfo()            item['group_url'] = ''.join(a_item.xpath('@href').extract())            item['group_tag'] = self.current            item['group_name'] = ''.join(a_item.xpath('@title').extract())            yield item        def parse_start_url(self, response):        self.logger.info(msg='begin init the start page %s ' % response.url)        list_item = response.xpath('//a[@class="nbg"]')        #check the group is not null         if list_item is None:            self.logger.info(msg='cant select anything in selector ')            return        for a_item in list_item:            item = GroupInfo()            item['group_url'] = ''.join(a_item.xpath('@href').extract())            item['group_tag'] = self.current            item['group_name'] = ''.join(a_item.xpath('@title').extract())            yield item    def parse_next_page_people(self, response):        self.logger.info('Hi, this is an the next page! %s', response.url)