爬蟲里面,我們不可避免的要用urllib中的urlopen()和requests.get()方法去請求或獲取一個網頁的內容,這里面的區別在于urlopen打開URL網址,url參數可以是一個字符串url或者是一個Request對象,返回的是http.client.HTTPResponse對象.http.client.HTTPResponse對象大概包括read()、readinto()、getheader()、getheaders()、fileno()、msg、version、status、reason、debuglevel和closed函數,其實一般而言使用read()函數后還需要decode()函數,這里一個巨大的優勢就是:返回的網頁內容實際上是沒有被解碼或的,在read()得到內容后通過指定decode()函數參數,可以使用對應的解碼方式。 而requests.get()方法請求了站點的網址,然后打印出了返回結果的類型,狀態碼,編碼方式,Cookies等內容
from urllib.request import urlopenimport requestsdata_get=requests.get("https://www.baidu.com").content.decode("utf-8")html_url=urlopen("https://www.baidu.com")data_url=html_url.read()with open("data_get.html","w") as f: f.write(data_get)print(data_get)print("------------------------/n")print(data_url)with open("data_url.html","wb") as f: f.write(data_url)data_get.html如下:
<!DOCTYPE html><!--STATUS OK--><html><head> <meta http-equiv=content-type content=text/html;charset=utf-8> <meta http-equiv=X-UA-Compatible content=IE=Edge> <meta content=always name=referrer> <link rel=stylesheet type=text/CSS href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css> <title>百度一下,你就知道</title> </head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu> <span class="bg s_data_url如下:<html><head> <script> location.replace(location.href.replace("https://","http://")); </script></head><body> <noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript></body></html>新聞熱點
疑難解答