Python下使用Scrapy爬取網頁內容的實例

2020-01-04 15:01:50

字體：大中小

來源：轉載

供稿：網友

上周用了一周的時間學習了Python和Scrapy，實現了從0到1完整的網頁爬蟲實現。研究的時候很痛苦，但是很享受，做技術的嘛。

首先，安裝Python，坑太多了，一個個爬。由于我是windows環境，沒錢買mac, 在安裝的時候遇到各種各樣的問題，確實各種各樣的依賴。

安裝教程不再贅述。如果在安裝的過程中遇到 ERROR：需要windows c/c++問題，一般是由于缺少windows開發編譯環境，晚上大多數教程是安裝一個VisualStudio，太不靠譜了，事實上只要安裝一個WindowsSDK就可以了。

下面貼上我的爬蟲代碼：

爬蟲主程序：

# -*- coding: utf-8 -*-import scrapyfrom scrapy.http import Requestfrom zjf.FsmzItems import FsmzItemfrom scrapy.selector import Selector# 圈圈：情感生活class MySpider(scrapy.Spider): #爬蟲名 name = "MySpider" #設定域名 allowed_domains = ["nvsheng.com"] #爬取地址 start_urls = [] #flag x = 0 #爬取方法 def parse(self, response):  item = FsmzItem()  sel = Selector(response)  item['title'] = sel.xpath('//h1/text()').extract()  item['text'] = sel.xpath('//*[@class="content"]/p/text()').extract()  item['imags'] = sel.xpath('//div[@id="content"]/p/a/img/@src|//div[@id="content"]/p/img/@src').extract()  if MySpider.x == 0:   page_list = MySpider.getUrl(self,response)   for page_single in page_list:    yield Request(page_single)  MySpider.x += 1  yield item #init: 動態傳入參數 #命令行傳參寫法： scrapy crawl MySpider -a start_url="http://some_url" def __init__(self,*args,**kwargs):  super(MySpider,self).__init__(*args,**kwargs)  self.start_urls = [kwargs.get('start_url')] def getUrl(self, response):  url_list = []  select = Selector(response)  page_list_tmp = select.xpath('//div[@class="viewnewpages"]/a[not(@class="next")]/@href').extract()  for page_tmp in page_list_tmp:   if page_tmp not in url_list:    url_list.append("http://www.nvsheng.com/emotion/px/" + page_tmp)  return url_list

PipeLines類

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom zjf import settingsimport json,os,re,randomimport urllib.requestimport requests, jsonfrom requests_toolbelt.multipart.encoder import MultipartEncoderclass MyPipeline(object): flag = 1 post_title = '' post_text = [] post_text_imageUrl_list = [] cs = [] user_id= '' def __init__(self):  MyPipeline.user_id = MyPipeline.getRandomUser('37619,18441390,18441391') #process the data def process_item(self, item, spider):  #獲取隨機user_id，模擬發帖  user_id = MyPipeline.user_id  #獲取正文text_str_tmp  text = item['text']  text_str_tmp = ""  for str in text:   text_str_tmp = text_str_tmp + str  # print(text_str_tmp)  #獲取標題  if MyPipeline.flag == 1:   MyPipeline.post_title = MyPipeline.post_title + title[0]  #保存并上傳圖片  text_insert_pic = ''  text_insert_pic_w = ''  text_insert_pic_h = ''  for imag_url in item['imags']:   img_name = imag_url.replace('/','').replace('.','').replace('|','').replace(':','')   pic_dir = settings.IMAGES_STORE + '%s.jpg' %(img_name)   urllib.request.urlretrieve(imag_url,pic_dir)   #圖片上傳，返回json   upload_img_result = MyPipeline.uploadImage(pic_dir,'image/jpeg')   #獲取json中保存圖片路徑   text_insert_pic = upload_img_result['result']['image_url']   text_insert_pic_w = upload_img_result['result']['w']   text_insert_pic_h = upload_img_result['result']['h']  #拼接json  if MyPipeline.flag == 1:   cs_json = {"c":text_str_tmp,"i":"","w":text_insert_pic_w,"h":text_insert_pic_h}  else:   cs_json = {"c":text_str_tmp,"i":text_insert_pic,"w":text_insert_pic_w,"h":text_insert_pic_h}  MyPipeline.cs.append(cs_json)  MyPipeline.flag += 1  return item #spider開啟時被調用 def open_spider(self,spider):  pass #sipder 關閉時被調用 def close_spider(self,spider):  strcs = json.dumps(MyPipeline.cs)  jsonData = {"apisign":"99ea3eda4b45549162c4a741d58baa60","user_id":MyPipeline.user_id,"gid":30,"t":MyPipeline.post_title,"cs":strcs}  MyPipeline.uploadPost(jsonData) #上傳圖片 def uploadImage(img_path,content_type):  "uploadImage functions"  #UPLOAD_IMG_URL = "http://api.qa.douguo.net/robot/uploadpostimage"  UPLOAD_IMG_URL = "http://api.douguo.net/robot/uploadpostimage"  # 傳圖片  #imgPath = 'D:/pics/http___img_nvsheng_com_uploads_allimg_170119_18-1f1191g440_jpg.jpg'  m = MultipartEncoder(   # fields={'user_id': '192323',   #   'images': ('filename', open(imgPath, 'rb'), 'image/JPEG')}   fields={'user_id': MyPipeline.user_id,     'apisign':'99ea3eda4b45549162c4a741d58baa60',     'image': ('filename', open(img_path , 'rb'),'image/jpeg')}  )  r = requests.post(UPLOAD_IMG_URL,data=m,headers={'Content-Type': m.content_type})  return r.json() def uploadPost(jsonData):  CREATE_POST_URL = http://api.douguo.net/robot/uploadimagespost

  reqPost = requests.post(CREATE_POST_URL,data=jsonData)

 def getRandomUser(userStr):  user_list = []  user_chooesd = ''  for user_id in str(userStr).split(','):   user_list.append(user_id)  userId_idx = random.randint(1,len(user_list))  user_chooesd = user_list[userId_idx-1]  return user_chooesd

字段保存Items類

# -*- coding: utf-8 -*-  # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html  import scrapy  class FsmzItem(scrapy.Item):  # define the fields for your item here like:  # name = scrapy.Field() #tutor = scrapy.Field()  #strongText = scrapy.Field()  text = scrapy.Field()  imags = scrapy.Field()

在命令行里鍵入

scrapy crawl MySpider -a start_url=www.aaa.com

這樣就可以爬取aaa.com下的內容了

以上這篇Python下使用Scrapy爬取網頁內容的實例就是小編分享給大家的全部內容了，希望能給大家一個參考，也希望大家多多支持VEVB武林網。

注：相關教程知識閱讀請移步到python教程頻道。

上一篇：Python常用字符串替換函數strip、replace及sub用法示例

下一篇：python 每天如何定時啟動爬蟲任務(實現方法分享)

學習交流

解決內存不足妙方

解決內存不足妙方...

熱門圖片

猜你喜歡的新聞

猜你喜歡的關注

新聞熱點

雷軍2020新年全員信：“5G+AIoT”五年投500億

2020-01-03 21:43:53

春運售票超3億張！售票總量再創歷史新高

2020-01-03 20:41:46

Windows10市場份額全球第一微軟是否再無敵手？

2020-01-03 20:31:47

比爾蓋茨一次錯誤，付出2.8萬億的代價

2020-01-02 08:44:34

長江迎來最長禁漁期：十年禁漁，方才有魚

2020-01-02 08:28:02

快手封殺淘寶？回應：系統升級，淘寶商品暫無法審核

2020-01-01 22:50:39

疑難解答

圖片精選

網友關注

国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

Python下使用Scrapy爬取網頁內容的實例