国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 編程 > Python > 正文

Python:Scrapy應用

2019-11-08 02:21:42
字體:
來源:轉載
供稿:網友

1、Create a PRoject

win+R-->cmd-->cd desktop-->scrapy startproject tutorial #this step will create a folder in the desktop.

2、Define an item

open the items.py file,the code:

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DmozItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    title = scrapy.Field()    link = scrapy.Field()    desc = scrapy.Field()3、Start a Scrapy

create a new file named dmoz_spider.py then start coding:

import scrapyclass DmozSpider(scrapy.Spider):    name = "dmoz"    allowed_domains = ['dmoz.org']    start_urls = [        'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',        'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/'        ]    def parse(slef,response):        filename = response.url.split("/")[-2]        with open(filename,'wb')
import scrapyclass DmozSpider(scrapy.Spider):    name = "dmoz"    allowed_domains = ['dmoz.org']    start_urls = [        'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',        'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/'        ]    def parse(slef,response):        '''filename = response.url.split("/")[-2]        with open(filename,'wb') as f:            f.write(response.body)'''        sel = scrapy.selector.Selector(response)        sites = sel.xpath('//div[@class="title-and-desc"]')        for site in sites:            title = site.xpath('a/div[@class="site-title"]/text()').extract()            link = site.xpath('a/@href').extract()            desc = site.xpath('div[@class="site-descr "]/text()').extract()            print(title,link,desc)                import scrapyclass DmozSpider(scrapy.Spider):    name = "dmoz"    allowed_domains = ['dmoz.org']    start_urls = [        'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',        'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/'        ]    def parse(slef,response):        '''filename = response.url.split("/")[-2]        with open(filename,'wb') as f:            f.write(response.body)'''        sel = scrapy.selector.Selector(response)        sites = sel.xpath('//div[@class="title-and-desc"]')        for site in sites:            title = site.xpath('a/div[@class="site-title"]/text()').extract()            link = site.xpath('a/@href').extract()            desc = site.xpath('div[@class="site-descr "]/text()').extract()            print(title,link,desc)f.write(response.body)

4、cmd:

-->cd desktop

-->cd tutorial

-->scrapy crawl dmoz

when the running ends,you will find two new files named Books and Resources are created  in desktop/tutorial folder

5、cmd:(not necessary)

-->Desktop/tutorial>scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books"    #this step get the response(must )

-->respose.body

>>> response.headers{'Cteonnt-Length': ['46147'], 'Content-Language': ['en'], 'Set-Cookie': ['JsessionID=CDE6228CA4B21EA2DE64C22A5578133C; Path=/; HttpOnly'], 'Server': ['Apache'],'Date': ['Sun, 19 Feb 2017 12:50:19 GMT'], 'Content-Type': ['text/html;charset=UTF-8']}>>> response.xpath('//title')[<Selector xpath='//title' data=u'<title>DMOZ - Computers: Programming: La'>]>>> response.xpath('//title').extract()[u'<title>DMOZ - Computers: Programming: Languages: Python: Books</title>']>>>

6、dmoz_spider.py code:

import scrapyfrom tutorial.items import DmozItemclass DmozSpider(scrapy.Spider):    name = "dmoz"    allowed_domains = ['dmoz.org']    start_urls = [        'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',        'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/'        ]    def parse(slef,response):        '''filename = response.url.split("/")[-2]        with open(filename,'wb') as f:            f.write(response.body)'''        sel = scrapy.selector.Selector(response)        sites = sel.xpath('//div[@class="title-and-desc"]')        items = []        for site in sites:            item = DmozItem()                        item['title'] = site.xpath('a/div[@class="site-title"]/text()').extract()            item['link'] = site.xpath('a/@href').extract()            item['desc'] = site.xpath('div[@class="site-descr "]/text()').extract()            #print(title,link,desc)            items.append(item)        return items7、cmd

-->cd tutorial

-->scrapy crawl dmoz -o items.json -t json

"items.json" file will be created


發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
主站蜘蛛池模板: 琼结县| 迭部县| 金秀| 偃师市| 澜沧| 久治县| 荔波县| 中西区| 阿瓦提县| 彩票| 丰城市| 崇文区| 河池市| 法库县| 新沂市| 泸定县| 望奎县| 永安市| 绥芬河市| 泰宁县| 新巴尔虎右旗| 印江| 察隅县| 天峻县| 潢川县| 神池县| 徐水县| 遂溪县| 中卫市| 柏乡县| 龙游县| 辽源市| 五华县| 皋兰县| 元朗区| 米易县| 昭苏县| 林甸县| 许昌市| 肥乡县| 北辰区|