python3 爬蟲學習——1

2019-11-06 07:15:30

字體：大中小

供稿：網(wǎng)友

最近在學習運用python寫爬蟲

買的書以及網(wǎng)上資料大多還停留在python2

由于部分庫有些改動，在博客里mark一下

爬蟲第一版

import urllib.requestdef download(url):    return urllib.request.urlopen(url).read()txt = download('https://www.baidu.com')PRint(txt.decode()) #default parameter is 'utf-8'
因為是第一版所以功能很簡單
通過urllib.request中的urlopen()函數(shù)直接獲取網(wǎng)址，此處我們訪問的是baidu
再通過解碼得到txt文本
需要注意的有兩點：
1.python2中的urllib2在python3中改成了urllib2.request 
2.上述代碼采用的是默認的解碼方式 只對通用的utf-8編碼的網(wǎng)頁有用 采用像gb2312這種國標的網(wǎng)頁可能會gg
第二版
import urllib.requestdef download(url,num_retries=2):    print('downloading: ',url)    try:        html = urllib.request.urlopen(url).read()    except urllib.request.URLError as e:        print('download error: ',e.reason)        html = None        if num_retries>0:            if hasattr(e,'code') and 500<=e.code<600:                return download(url,num_retries-1)    return htmlpage = download('http://httpstat.us/500')if page != None:    print(page.decode())else:    print('Receive None')第二版在第一版的基礎(chǔ)上加了異常處理
在訪問網(wǎng)頁的時候最常見的是404異常（表示網(wǎng)頁目前不存在）
4xx的錯誤發(fā)生在請求存在問題的時候
5xx的錯誤發(fā)生在服務器端存在問題
因此我們面對5xx的錯誤可以采用重試下載來應對
上述代碼對于5xx的錯誤 會重試三次 三次都不成功會放棄
實例中的url是個美國的地址所以基本上都會error
第三版
#自動化識別網(wǎng)頁編碼方式并以改格式解碼import chardetimport urllib.requestdef download(url,num_retries=2):    print('downloading: ',url)    try:        html = urllib.request.urlopen(url).read()    except urllib.request.URLError as e:        print('download error: ',e.reason)        html = None        if num_retries>0:            if hasattr(e,'code') and 500<=e.code<600:                return download(url,num_retries-1)    return htmlpage = download('http://www.sdu.edu.cn')if page != None:    charset_info = chardet.detect(page)     #獲取文本編碼方式    print(charset_info)    print(page.decode(charset_info['encoding'],'ignore'))else:    print('Receive None')這一版的更新加入了自動識別網(wǎng)頁編碼方式
利用chardet這個模塊給的detect方法檢測網(wǎng)頁編碼方式
并采取該方式解碼
需要注意的是有些網(wǎng)站很大時，檢測時間會比較長 
這種情況下只檢測網(wǎng)站部分內(nèi)容即可
chardet.detect(page[:500])
第四版
#使用代理進行訪問#自動化識別網(wǎng)頁編碼方式并以改格式解碼import chardetimport urllib.requestdef download(url,user_agent='wswp',num_retries=2):    print('downloading: ',url)    headers = {'User-agent':user_agent}    request = urllib.request.Request(url,headers=headers)    try:        response = urllib.request.urlopen(request)        html = response.read()    except urllib.request.URLError as e:        print('download error: ',e.reason)        html = None        if num_retries>0:            if hasattr(e,'code') and 500<=e.code<600:                return download(url,num_retries-1)    return htmldef decode_page(page):    if page != None:        charset_info = chardet.detect(page[:500])  # 獲取文本編碼方式        charset = charset_info['encoding']        return page.decode(charset, 'ignore')    else:        return 'None Page'page = download('http://www.nju.edu.cn')txt = decode_page(page)print(txt)這次的改進是加入了代理，因為很多網(wǎng)站會限制爬蟲，因此很多時候爬蟲都要偽裝成瀏覽器經(jīng)常偽裝成Mozilla。。當然我們這個版本的只是演示一下這次訪問的是南大網(wǎng)站接下來的版本放到了下一篇博客中