国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 編程 > Python > 正文

基于scrapy實(shí)現(xiàn)的簡單蜘蛛采集程序

2020-01-04 19:28:34
字體:
供稿:網(wǎng)友

本文實(shí)例講述了基于scrapy實(shí)現(xiàn)的簡單蜘蛛采集程序。分享給大家供大家參考。具體如下:

# Standard Python library imports# 3rd party importsfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.selector import HtmlXPathSelector# My importsfrom poetry_analysis.items import PoetryAnalysisItemHTML_FILE_NAME = r'.+/.html'class PoetryParser(object): """ Provides common parsing method for poems formatted this one specific way. """ date_pattern = r'(/d{2} /w{3,9} /d{4})'def parse_poem(self, response):hxs = HtmlXPathSelector(response)item = PoetryAnalysisItem()# All poetry text is in pre tagstext = hxs.select('//pre/text()').extract()item['text'] = ''.join(text)item['url'] = response.url# head/title contains title - a poem by authortitle_text = hxs.select('//head/title/text()').extract()[0]item['title'], item['author'] = title_text.split(' - ')item['author'] = item['author'].replace('a poem by', '')for key in ['title', 'author']:item[key] = item[key].strip()item['date'] = hxs.select("http://p[@class='small']/text()").re(date_pattern)return itemclass PoetrySpider(CrawlSpider, PoetryParser): name = 'example.com_poetry' allowed_domains = ['www.example.com'] root_path = 'someuser/poetry/' start_urls = ['http://www.example.com/someuser/poetry/recent/','http://www.example.com/someuser/poetry/less_recent/'] rules = [Rule(SgmlLinkExtractor(allow=[start_urls[0] + HTML_FILE_NAME]),callback='parse_poem'),Rule(SgmlLinkExtractor(allow=[start_urls[1] + HTML_FILE_NAME]),callback='parse_poem')]

希望本文所述對大家的Python程序設(shè)計有所幫助。

發(fā)表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發(fā)表
主站蜘蛛池模板: 依兰县| 察隅县| 长泰县| 盈江县| 高碑店市| 怀集县| 宜丰县| 屏南县| 古浪县| 张家港市| 祁连县| 金溪县| 西乡县| 嫩江县| 文登市| 南漳县| 石屏县| 墨竹工卡县| 韩城市| 康马县| 张家界市| 连云港市| 大关县| 南木林县| 白城市| 双牌县| 美姑县| 买车| 扎兰屯市| 宁乡县| 册亨县| 托里县| 德兴市| 张家界市| 济源市| 福鼎市| 鹤庆县| 固原市| 新乡市| 伽师县| 阿坝县|