學(xué)習(xí)Python selenium自動(dòng)化網(wǎng)頁抓取器

2020-01-04 16:11:26

字體：大中小

供稿：網(wǎng)友

直接入正題---Python selenium自動(dòng)控制瀏覽器對網(wǎng)頁的數(shù)據(jù)進(jìn)行抓取，其中包含按鈕點(diǎn)擊、跳轉(zhuǎn)頁面、搜索框的輸入、頁面的價(jià)值數(shù)據(jù)存儲(chǔ)、mongodb自動(dòng)id標(biāo)識(shí)等等等。

1、首先介紹一下 Python selenium ---自動(dòng)化測試工具，用來控制瀏覽器來對網(wǎng)頁的操作，在爬蟲中與BeautifulSoup結(jié)合那就是天衣無縫，除去國外的一些變態(tài)的驗(yàn)證網(wǎng)頁，對于圖片驗(yàn)證碼我有自己寫的破解圖片驗(yàn)證碼的源代碼，成功率在85%。

詳情請咨詢QQ群--607021567（這不算廣告，群里有好多Python的資源分享，還有大數(shù)據(jù)的一些知識(shí)【hadoop】）

2、beautifulsoup就不需要詳細(xì)的介紹了，直接上網(wǎng)址:：https://www.crummy.com/software/BeautifulSoup/bs4/doc/（BeautifulSoup的官方文檔）

3、關(guān)于mongodb的自動(dòng)id的生成。mongodb中所有的存儲(chǔ)數(shù)據(jù)都是有固定的id的，但是mongodb的id對于人類來講是復(fù)雜的，對于機(jī)器來講是小菜一碟的，所以在存入數(shù)據(jù)的同時(shí)，我習(xí)慣用新id來對每一條數(shù)據(jù)的負(fù)責(zé)！

在Python中使用mongodb的話需要引進(jìn)模塊 from pymongo import MongoClient,ASCENDING, DESCENDING ，該模塊就是你的責(zé)任！

接下來開始講程序，直接上實(shí)例（一步一步來）：

引入模塊：

from selenium import webdriverfrom bs4 import BeautifulSoupimport requestsfrom pymongo import MongoClient,ASCENDING, DESCENDINGimport timeimport re

其中的每一個(gè)模塊都會(huì)說已經(jīng)解釋過了，其中的re、requests都是之前就有提過的，他們都是核心缺一不可！

首先，我舉一個(gè)小例子，淘寶的自動(dòng)模擬搜索功能（源碼）：

先說一下selenium 的定位方法

find_element_by_idfind_element_by_namefind_element_by_xpathfind_element_by_link_textfind_element_by_partial_link_textfind_element_by_tag_namefind_element_by_class_namefind_element_by_css_selector

源碼：

from selenium import webdriverfrom bs4 import BeautifulSoupimport requestsfrom pymongo import MongoClient,ASCENDING, DESCENDINGimport timeimport redef TaoBao(): try:  Taobaourl = 'https://www.taobao.com/'  driver = webdriver.Chrome()  driver.get(Taobaourl)  time.sleep(5)#通常這里需要停頓，不然你的程序很有可能被檢測到是Spider  text='Strong Man'#輸入的內(nèi)容  driver.find_element_by_xpath('//input[@class="search-combobox-input"]').send_keys(text).click()  driver.find_element_by_xpath('//button[@class="btn-search tb-bg"]').click()  driver.quit() except Exception,e:  print eif __name__ == '__main__': TaoBao()

效果的實(shí)現(xiàn)，你們可以直接復(fù)制后直接運(yùn)行！我只用了xpath的這個(gè)方法，因?yàn)樗顚?shí)在！橙色字體（如果我沒有色盲的話），就是網(wǎng)頁中定位的元素，可以找到的！

接下來就是與BeautifulSoup的結(jié)合了，但是我們看到的只是打開了網(wǎng)頁，并沒有源碼，那么就需要 “變量名.page_source”這個(gè)方法，他會(huì)實(shí)現(xiàn)你的夢想，你懂得?

ht = driver.page_source#print ht 你可以Print出啦看看soup = BeautifulSoup(ht,'html.parser')

下面就是BeautifulSoup的一些語法操作了，對于數(shù)據(jù)的結(jié)構(gòu)還有采集，在上一篇里面有詳細(xì)的抓取操作！！！

算了！說一個(gè)最簡單的定位抓取：

soup = BeautifulSoup(ht,'html.parser')a = soup.find('table',id="ctl00_ContentMain_SearchResultsGrid_grid")if a: #必須加判斷，不然訪問的網(wǎng)頁或許沒有這一元素，程序就會(huì)都停止！

class的標(biāo)簽必須是class_,一定要記住！

哈哈哈！mongodb了昂，細(xì)節(jié)細(xì)節(jié)，首先需要用到模塊----from pymongo import MongoClient,ASCENDING, DESCENDING

因?yàn)樵趐ython，mongodb的語法仍然實(shí)用，所以需要定義一個(gè)庫，并且是全局性的，還有鏈接你計(jì)算機(jī)的一個(gè)全局變量。

if __name__ == '__main__':  global db#全局變量      global table#全局?jǐn)?shù)據(jù)庫 table = 'mouser_product' mconn=MongoClient("mongodb://localhost")#地址 db=mconn.test db.authenticate('test','test')#用戶名和密碼 Taobao()

定義這些后，需要我們的新id來對數(shù)據(jù)的跟蹤加定義：

db.sn.find_and_modify({"_id": table}, update={ "$inc": {'currentIdValue': 1}},upsert=True)dic = db.ids.find({"_id":table}).limit(1)return dic[0].get("currentIdValue")

這個(gè)方法是通用的，所以只要記住其中的mongodb的語法就可以了！因?yàn)檫@里是有返回值的，所以這個(gè)是個(gè)方法體，這里不需要太過于糾結(jié)是怎么實(shí)現(xiàn)的，理解就好，中心還是在存數(shù)據(jù)的過程中

count = db[table].find({'數(shù)據(jù)':數(shù)據(jù)}).count() #是檢索數(shù)據(jù)庫中的數(shù)據(jù)if count <= 0:        #判斷是否有ids= getNewsn()       #ids就是我們新定義的id，這里的id是1開始的增長型iddb[table].insert({"ids":ids,"數(shù)據(jù)":數(shù)據(jù)})

這樣我們的數(shù)據(jù)就直接存入到mongodb的數(shù)據(jù)庫中了，這里解釋一下為什么在大數(shù)據(jù)中這么喜歡mongodb，因?yàn)樗∏桑俣燃眩?/p>

最后來一個(gè)實(shí)例源碼：

from selenium import webdriverfrom bs4 import BeautifulSoupimport requestsfrom pymongo import MongoClient,ASCENDING, DESCENDINGimport timeimport redef parser(): try:  f = open('sitemap.txt','r')  for i in f.readlines():   sorturl=i.strip()   driver = webdriver.Firefox()   driver.get(sorturl)   time.sleep(50)   ht = driver.page_source   #pageurl(ht)   soup = BeautifulSoup(ht,'html.parser')   a = soup.find('a',class_="first-last")   if a:    pagenum = int(a.get_text().strip())    print pagenum    for i in xrange(1,pagenum):     element = driver.find_element_by_xpath('//a[@id="ctl00_ContentMain_PagerTop_%s"]' %i)     element.click()     html = element.page_source     pageurl(html)     time.sleep(50)     driver.quit() except Exception,e:  print edef pageurl(ht): try:  soup = BeautifulSoup(ht,'html.parser')  a = soup.find('table',id="ctl00_ContentMain_SearchResultsGrid_grid")  if a:   tr = a.find_all('tr',class_="SearchResultsRowOdd")   if tr:     for i in tr:      td = i.find_all('td')      if td:       url = td[2].find('a')       if url:        producturl = '網(wǎng)址'+url['href']        print producturl        count = db[table].find({"url":producturl}).count()        if count<=0:         sn = getNewsn()         db[table].insert({"sn":sn,"url":producturl})         print str(sn) + ' inserted successfully'         time.sleep(3)        else:         print 'exists url'   tr1 = a.find_all('tr',class_="SearchResultsRowEven")   if tr1:     for i in tr1:      td = i.find_all('td')      if td:       url = td[2].find('a')       if url:        producturl = '網(wǎng)址'+url['href']        print producturl        count = db[table].find({"url":producturl}).count()        if count<=0:         sn = getNewsn()         db[table].insert({"sn":sn,"url":producturl})         print str(sn) + ' inserted successfully'         time.sleep(3)        else:         print 'exists url'        #time.sleep(5) except Exception,e:  print edef getNewsn():  db.sn.find_and_modify({"_id": table}, update={ "$inc"{'currentIdValue': 1}},upsert=True) dic = db.sn.find({"_id":table}).limit(1) return dic[0].get("currentIdValue")if __name__ == '__main__':  global db      global table table = 'mous_product' mconn=MongoClient("mongodb://localhost") db=mconn.test db.authenticate('test','test') parser()

這一串代碼是破解一個(gè)老外的無聊驗(yàn)證碼界面結(jié)緣的，我真的對他很無語了！破解方法還是實(shí)踐中！這是完整的源碼，無刪改的哦！純手工！

注：相關(guān)教程知識(shí)閱讀請移步到python教程頻道。

上一篇：python使用pil庫實(shí)現(xiàn)圖片合成實(shí)例代碼

下一篇：分析Python中解析構(gòu)建數(shù)據(jù)知識(shí)