【Python爬蟲3】在下載的本地緩存做爬蟲

2019-11-08 03:19:33

字體：大中小

來源：轉載

供稿：網友

1為鏈接爬蟲添加緩存支持2磁盤緩存1用磁盤緩存的實現2緩存測試3節省磁盤空間4清理過期數據5用磁盤緩存的缺點3數據庫緩存1NoSQL是什么2安裝MongoDB3MongoDB概述4MongoDB緩存實現5壓縮存儲6緩存測試7MongoDB緩存完整代碼

上篇文章，我們學習了如何提取網頁中的數據，以及將提取結果存到表格中。如果我們還想提取另一字段，則需要重新再下載整個網頁，這對我們這個小型的示例網站問題不大，但對于數百萬個網頁的網站而言來說就要消耗幾個星期的時間。所以，我們可以先對網頁進行緩存，就使得每個網頁只下載一次。

1為鏈接爬蟲添加緩存支持

我們將downloader重構一類，這樣參數只需在構造方法中設置一次，就能在后續多次復用，在URL下載之前進行緩存檢查，并把限速功能移到函數內部。在Downloader類的call特殊方法實現了下載前先檢查緩存，如果已經定義該URL緩存則再檢查下載中是否遇到了服務端錯誤，如果都沒問題表明緩存結果可用，否則都需要正常下載該URL存到緩存中。downloader方法返回添加了HTTP狀態碼，以便緩存中存儲錯誤機校驗。如果不需要限速或緩存的話，你可以直接調用該方法，這樣就不會通過call方法調用了。class Downloader: def __init__(self, delay=5, user_agent='Wu_Being', PRoxies=None, num_retries=1, cache=None): self.throttle = Throttle(delay) self.user_agent = user_agent self.proxies = proxies self.num_retries = num_retries self.cache = cache def __call__(self, url): result = None if self.cache: try: result = self.cache[url] except KeyError: # url is not available in cache pass else: if self.num_retries > 0 and 500 <= result['code'] < 600: # server error so ignore result from cache and re-download result = None if result is None: # result was not loaded from cache so still need to download self.throttle.wait(url) proxy = random.choice(self.proxies) if self.proxies else None headers = {'User-agent': self.user_agent} result = self.download(url, headers, proxy=proxy, num_retries=self.num_retries) if self.cache: # save result to cache self.cache[url] = result return result['html'] def download(self, url, headers, proxy, num_retries, data=None): print 'Downloading:', url ... return {'html': html, 'code': code}class Throttle: def __init__(self, delay): ... def wait(self, url): ...

為了支持緩存功能，鏈接爬蟲代碼也需用一些微調，包括添加cache參數、移除限速以及將download函數替換為新的類。

from downloader import Downloaderdef link_crawler(... cache=None): crawl_queue = [seed_url] seen = {seed_url: 0} # track how many URL's have been downloaded num_urls = 0 rp = get_robots(seed_url) #cache.clear() ############################### D = Downloader(delay=delay, user_agent=user_agent, proxies=proxies, num_retries=num_retries, cache=cache) while crawl_queue: url = crawl_queue.pop() depth = seen[url] # check url passes robots.txt restrictions if rp.can_fetch(user_agent, url): html = D(url) ###def __call__(self, url): links = [] ...def normalize(seed_url, link): ...def same_domain(url1, url2): ...def get_robots(url): ...def get_links(html): ..."""if __name__ == '__main__': link_crawler('http://example.webscraping.com', '/(index|view)', delay=0, num_retries=1, user_agent='BadCrawler') link_crawler('http://example.webscraping.com', '/(index|view)', delay=0, num_retries=1, max_depth=1, user_agent='GoodCrawler')"""

現在，這個支持緩存的網絡爬蟲的基本架構已經準備好了，下面就要開始構建實際的緩存功能了。

2磁盤緩存

操作系統	文件系統	非法文件名字符	文件名最大長度
linux	Ext3/Ext4	`/`和`/0`	255個字節
OS X	HFS Plus	`:`和`/0`	255個UTF-16編碼單元
Windows	NTFS	`/`、`/`、`?`、`:`、`*`、`"`、`>`、`<`和`	`

為了保證在不同文件系統中，我們的文件路徑都是安全的，就需要把除數字、字母和基本符號的其他字符替換為下劃線。

>>> import re>>> url="http://example.webscraping.com/default/view/australia-1">>> re.sub('[^/0-9a-zA-Z/-,.;_ ]','_',url)'http_//example.webscraping.com/default/view/australia-1'

此外，文件名及其目錄長度需要限制在255個字符以內。

>>> filename=re.sub('[^/0-9a-zA-Z/-,.;_ ]','_',url)>>> filename='/'.join(segment[:255] for segment in filename.split('/'))>>> print filenamehttp_//example.webscraping.com/default/view/australia-1>>> print '#'.join(segment[:5] for segment in filename.split('/'))http_##examp#defau#view#austr>>>

還有一種邊界情況，就是URL以斜杠結尾。這樣分割URL后就會造成一個非法的文件名。例如： - http://example.webscraping.com/index/ - http://example.webscraping.com/index/1

對于第一個URL可以在后面添加index.html作為文件名，所以可以把index作為目錄名，1為子目錄名，index.html為文件名。

>>> import urlparse>>> components=urlparse.urlsplit('http://exmaple.scraping.com/index/')>>> print componentsSplitResult(scheme='http', netloc='exmaple.scraping.com', path='/index/', query='', fragment='')>>> print components.path/index/>>> path=components.path>>> if not path:... path='/index.html'... elif path.endswith('/'):... path+='index.html'... >>> filename=components.netloc+path+components.query>>> filename'exmaple.scraping.com/index/index.html'>>>

2.1用磁盤緩存的實現

現在可以把URL到目錄和文件名完整映射邏輯結合起來，就形成了磁盤緩存的主要部分。該構造方法傳入了用于設定緩存位置的參數，然后在url_to_path方法中應用了前面討論的文件名限制。

from link_crawler import link_crawlerclass DiskCache: def __init__(self, cache_dir='cache', ...): """ cache_dir: the root level folder for the cache """ self.cache_dir = cache_dir ... def url_to_path(self, url): """Create file system path for this URL """ components = urlparse.urlsplit(url) # when empty path set to /index.html path = components.path if not path: path = '/index.html' elif path.endswith('/'): path += 'index.html' filename = components.netloc + path + components.query # replace invalid characters filename = re.sub('[^/0-9a-zA-Z/-.,;_ ]', '_', filename) # restrict maximum number of characters filename = '/'.join(segment[:255] for segment in filename.split('/')) return os.path.join(self.cache_dir, filename) #拼接當前目錄和文件名為完整目錄 def __getitem__(self, url): ... def __setitem__(self, url, result): ... def __delitem__(self, url): ... def has_expired(self, timestamp): ... def clear(self): ...if __name__ == '__main__': link_crawler('http://example.webscraping.com/', '/(index|view)', cache=DiskCache())

現在我們還缺少根據文件名存取數據的方法，就是Downloader類result=cache[url]和cache[url]=result的接口方法：__getitem__()和__setitem__()兩個特殊方法。

import pickleclass DiskCache: def __init__(self, cache_dir='cache', expires=timedelta(days=30), compress=True): ... def url_to_path(self, url): ... def __getitem__(self, url): ... def __setitem__(self, url, result): """Save data to disk for this url """ path = self.url_to_path(url) folder = os.path.dirname(path) if not os.path.exists(folder): os.makedirs(folder) with open(path, 'wb') as fp: fp.write(pickle.dumps(result))

在__setitem__()中，我們使用url_to_path()方法將URL映射為安全文件名，在必要情況下還需要創建目錄。這里使用的pickle模塊會把輸入轉化為字符串（序列化），然后保存到磁盤中。

import pickleclass DiskCache: def __init__(self, cache_dir='cache', expires=timedelta(days=30), compress=True): ... def url_to_path(self, url): ... def __getitem__(self, url): """Load data from disk for this URL """ path = self.url_to_path(url) if os.path.exists(path): with open(path, 'rb') as fp: return pickle.loads(fp.read()) else: # URL has not yet been cached raise KeyError(url + ' does not exist') def __setitem__(self, url, result): ...

在__getitem__()中，還是先用url_to_path()方法將URL映射為安全文件名。然后檢查文件是否存在，如果存在則加載內容，并執行反序列化，恢復其原始數據類型；如果不存在，則說明緩存中還沒有該URL的數據，此時會拋出KeyError異常。

2.2緩存測試

可以在python命令前加time計時。我們可以發現，如果是在本地服務器的網站，當緩存為空時爬蟲實際耗時0m58.710s，第二次運行全部從緩存讀取花了0m0.221s,快了265多倍。如果是爬取遠程服務器的網站的數據時，將會耗更多時間。

wu_being@Ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 2disk_cache_Nozip127.py Downloading: http://127.0.0.1:8000/places/Downloading: http://127.0.0.1:8000/places/default/index/1...Downloading: http://127.0.0.1:8000/places/default/view/Afghanistan-1real 0m58.710suser 0m0.684ssys 0m0.120swu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 2disk_cache_Nozip127.py real 0m0.221suser 0m0.204ssys 0m0.012s

2.3節省磁盤空間

為節省緩存占用空間，我們可以對下載的HTML文件進行壓縮處理，使用zlib壓縮序列化字符串即可。

fp.write(zlib.compress(pickle.dumps(result)))

從磁盤加載后解壓的代碼如下：

return pickle.loads(zlib.decompress(fp.read()))

壓縮所有網頁之后，緩存占用大小2.8 MB下降到821.2 KB，耗時略有增加。

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 2disk_cache.py Downloading: http://127.0.0.1:8000/places/Downloading: http://127.0.0.1:8000/places/default/index/1...Downloading: http://127.0.0.1:8000/places/default/view/Afghanistan-1real 1m0.011suser 0m0.800ssys 0m0.104swu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 2disk_cache.py real 0m0.252suser 0m0.228ssys 0m0.020swu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$

2.4清理過期數據

本節中，我們將為緩存數據添加過期時間，以便爬蟲知道何時需要重新下載網頁。在構造方法中，我們使用timedelta對象將默認過期時間設置為30天，在__set__方法中把當前時間戳保存在序列化數據中，在__get__方法中對比當前時間和緩存時間，檢查是否過期。

from datetime import datetime, timedeltaclass DiskCache: def __init__(self, cache_dir='cache', expires=timedelta(days=30), compress=True): """ cache_dir: the root level folder for the cache expires: timedelta of amount of time before a cache entry is considered expired compress: whether to compress data in the cache """ self.cache_dir = cache_dir self.expires = expires self.compress = compress def __getitem__(self, url): """Load data from disk for this URL """ path = self.url_to_path(url) if os.path.exists(path): with open(path, 'rb') as fp: data = fp.read() if self.compress: data = zlib.decompress(data) result, timestamp = pickle.loads(data) if self.has_expired(timestamp): raise KeyError(url + ' has expired') return result else: # URL has not yet been cached raise KeyError(url + ' does not exist') def __setitem__(self, url, result): """Save data to disk for this url """ path = self.url_to_path(url) folder = os.path.dirname(path) if not os.path.exists(folder): os.makedirs(folder) data = pickle.dumps((result, datetime.utcnow())) if self.compress: data = zlib.compress(data) with open(path, 'wb') as fp: fp.write(data) ... def has_expired(self, timestamp): """Return whether this timestamp has expired """ return datetime.utcnow() > timestamp + self.expires

為了測試時間功能，我們可以將其縮短為5秒，如下操作：

""" Dictionary interface that stores cached values in the file system rather than in memory. The file path is formed from an md5 hash of the key. """>>> from disk_cache import DiskCache>>> cache=DiskCache()>>> url='http://www.baidu.com'>>> result={'html':'<html>...','code':200}>>> cache[url]=result>>> cache[url]{'code': 200, 'html': '<html>...'}>>> cache[url]['html']==result['html']True>>> >>> from datetime import timedelta>>> cache2=DiskCache(expires=timedelta(seconds=5))>>> url2='https://www.baidu.sss'>>> result2={'html':'<html>..ss.','code':500}>>> cache2[url2]=result2>>> cache2[url2]{'code': 200, 'html': '<html>...'}>>> cache2[url2]{'code': 200, 'html': '<html>...'}>>> cache2[url2]{'code': 200, 'html': '<html>...'}>>> cache2[url2]{'code': 200, 'html': '<html>...'}>>> cache2[url2]Traceback (most recent call last): File "<stdin>", line 1, in <module> File "disk_cache.py", line 57, in __getitem__ raise KeyError(url + ' has expired')KeyError: 'http://www.baidu.com has expired'>>> cache2.clear()

2.5用磁盤緩存的缺點

由于受制于文件系統的限制，之前我們將URL映射為安全文件名，然而這樣又會引發一些問題： - 有些URL會被映射為相同的文件名。比如URL：.../count.asp?a+b,.../count.asp?a*b。 - URL截斷255個字符的文件名也可能相同。因為URL可以超過2000下字符。

使用URL哈希值為文件名可以帶來一定的改善。這樣也有一些問題： - 每個卷和每個目錄下的文件數量是有限制的。FAT32文件系統每個目錄的最大文件數65535，但可以分割到不同目錄下。 - 文件系統可存儲的文件總數也是有限的。ext4分區目前支持略多于1500萬個文件，而一個大型網站往往擁有超過1億個網頁。

要想避免這些問題，我們需要把多個緩存網頁合并到一個文件中，并使用類似B+樹的算法進行索引。但我們不會自己實現這種算法，而是在下一節中介紹已實現這類算法的數據庫。

3數據庫緩存

爬取時，我們可能需要緩存大量數據，但又無須任何復雜的連接操作，因此我們將選用NoSQL數據庫，這種數據庫比傳統的關系型數據庫更容易擴展。在本節中，我們將選用目前非常流行的MongoDB作為緩存數據庫。

3.1NoSQL是什么

NoSQL全稱為Not Only SQL，是一種相對較新的數據庫設計方式。傳統的關系模型使用是固定模式，并將數據分割到各個表中。然而，對于大數據集的情況，數據量太大使其難以存放在單一服務器中，此時就需要擴展到多臺服務器。不過，關系模型對于這種擴展的支持并不夠好，因為在查詢多個表時，數據可能在不同的服務器中。相反，NoSQL數據庫通常是無模式的，從設計之初就考慮了跨服務器無縫分片的問題。在NoSQL中，有多種方式可以實現該目標，分別是：

- 列數據存儲（如HBase）； - 鍵值對存儲（如Redis）； - 圖形數據庫（如Neo4j）； - 面向文檔的數據庫（如MongoDB）。

3.2安裝MongoDB

MongoDB可以從https://www.mongodb.org/downloads 下載。然后安裝其Python封裝庫：

pip install pymongo

檢測安裝是否成功，在本地啟動MongoDB服務器：

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ mongod -dbpath MongoD2017-01-17T21:20:46.224+0800 [initandlisten] MongoDB starting : pid=1978 port=27017 dbpath=MongoD 64-bit host=ubuntukylin642017-01-17T21:20:46.224+0800 [initandlisten] db version v2.6.102017-01-17T21:20:46.224+0800 [initandlisten] git version: nogitversion2017-01-17T21:20:46.225+0800 [initandlisten] OpenSSL version: OpenSSL 1.0.2g 1 Mar 20162017-01-17T21:20:46.225+0800 [initandlisten] build info: Linux lgw01-12 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015 x86_64 BOOST_LIB_VERSION=1_582017-01-17T21:20:46.225+0800 [initandlisten] allocator: tcmalloc2017-01-17T21:20:46.225+0800 [initandlisten] options: { storage: { dbPath: "MongoD" } }2017-01-17T21:20:46.269+0800 [initandlisten] journal dir=MongoD/journal2017-01-17T21:20:46.270+0800 [initandlisten] recover : no journal files present, no recovery needed2017-01-17T21:20:49.126+0800 [initandlisten] preallocateIsFaster=true 33.722017-01-17T21:20:51.932+0800 [initandlisten] preallocateIsFaster=true 32.72017-01-17T21:20:55.729+0800 [initandlisten] preallocateIsFaster=true 32.362017-01-17T21:20:55.730+0800 [initandlisten] preallocateIsFaster check took 9.459 secs2017-01-17T21:20:55.730+0800 [initandlisten] preallocating a journal file MongoD/journal/prealloc.02017-01-17T21:20:58.042+0800 [initandlisten] File Preallocator Progress: 608174080/1073741824 56%2017-01-17T21:21:03.290+0800 [initandlisten] File Preallocator Progress: 744488960/1073741824 69%2017-01-17T21:21:08.043+0800 [initandlisten] File Preallocator Progress: 954204160/1073741824 88%2017-01-17T21:21:18.347+0800 [initandlisten] preallocating a journal file MongoD/journal/prealloc.12017-01-17T21:21:21.166+0800 [initandlisten] File Preallocator Progress: 639631360/1073741824 59%2017-01-17T21:21:26.328+0800 [initandlisten] File Preallocator Progress: 754974720/1073741824 70%...

然后，在Python中，使用MongoDB的默認端口嘗試連接MongoDB：

>>> from pymongo import MongoClient>>> client=MongoClient('localhost',27017)

3.3MongoDB概述

下面是MongoDB示例代碼：

>>> from pymongo import MongoClient>>> client=MongoClient('localhost',27017)>>> url='http://www.baidu.com/view/China-47'>>> html='...<html>...'>>> db=client.cache>>> db.webpage.insert({'url':url,'html':html})ObjectId('587e2cb26b00c10b956e0be9')>>> db.webpage.find_one({'url':url}){u'url': u'http://www.baidu.com/view/China-47', u'_id': ObjectId('587e2cb26b00c10b956e0be9'), u'html': u'...<html>...'}>>> db.webpage.find({'url':url})<pymongo.cursor.Cursor object at 0x7fcde0ca60d0>>>> db.webpage.find({'url':url}).count()1

當插入同一條記錄時，MongoDB會欣然接受并執行這次操作，但通過查找發現記錄沒更新。

>>> db.webpage.insert({'url':url,'html':html})ObjectId('587e2d546b00c10b956e0bea')>>> db.webpage.find({'url':url}).count()2>>> db.webpage.find_one({'url':url}){u'url': u'http://www.baidu.com/view/China-47', u'_id': ObjectId('587e2cb26b00c10b956e0be9'), u'html': u'...<html>...'}

為了存儲最新的記錄，并避免重復記錄，我們將ID設置為URL，并執行upsert操作。該操作表示當記錄存在時則更新記錄，否則插入新記錄。

>>> >>> new_html='<...>...'>>> db.webpage.update({'_id':url},{'$set':{'html':new_html}},upsert=True){'updatedExisting': True, u'nModified': 1, u'ok': 1, u'n': 1}>>> db.webpage.find_one({'_id':url}){u'_id': u'http://www.baidu.com/view/China-47', u'html': u'<...>...'}>>> db.webpage.find({'_id':url}).count()1>>> db.webpage.update({'_id':url},{'$set':{'html':new_html}},upsert=True){'updatedExisting': True, u'nModified': 0, u'ok': 1, u'n': 1}>>> db.webpage.find({'_id':url}).count()1>>>

MongoDB官方文檔：http://docs.mongodb.org/manual/

3.4MongoDB緩存實現

現在我們已經準備好創建基于MongoDB的緩存了，這里使用了和之前的DiskCache類相同的接口。我們在下面構造方法中創建了timestamp索引，在達到給定的時間戳之后，MongoDB的這一便捷功能可以自動過期刪除記錄。

import picklefrom datetime import datetime, timedeltafrom pymongo import MongoClientclass MongoCache: def __init__(self, client=None, expires=timedelta(days=30)): """ client: mongo database client expires: timedelta of amount of time before a cache entry is considered expired """ # if a client object is not passed # then try connecting to mongodb at the default localhost port self.client = MongoClient('localhost', 27017) if client is None else client #create collection to store cached webpages, # which is the equivalent of a table in a relational database self.db = self.client.cache self.db.webpage.create_index('timestamp', expireAfterSeconds=expires.total_seconds()) def __getitem__(self, url): """Load value at this URL """ record = self.db.webpage.find_one({'_id': url}) if record: return record['result'] else: raise KeyError(url + ' does not exist') def __setitem__(self, url, result): """Save value for this URL """ record = {'result': result, 'timestamp': datetime.utcnow()} self.db.webpage.update({'_id': url}, {'$set': record}, upsert=True)

下面我們來測試一下這個MongoCache類，我們用默認0時間間隔timedelta()對象進行測試，此時記錄創建后應該會馬上會被刪除，但實際卻沒有。這是因為MongoDB運行機制造成的，MongoDB后臺運行了一個每分鐘檢查一次過期記錄的任務。所以我們可以再等一分鐘，就會發現緩存過期機制已經運行成功了。

>>> from mongo_cache import MongoCache>>> from datetime import timedelta>>> cache=MongoCache(expires=timedelta())>>> result={'html':'.....'}>>> cache[url]=result>>> cache[url]{'html': '.....'}>>> cache[url]{'html': '.....'}>>> import time>>> import time;time.sleep(60)>>> cache[url]Traceback (most recent call last): File "<stdin>", line 1, in <module> File "mongo_cache.py", line 62, in __getitem__ raise KeyError(url + ' does not exist')KeyError: 'http://www.baidu.com/view/China-47 does not exist'>>>

3.5壓縮存儲

import pickleimport zlibfrom bson.binary import Binaryclass MongoCache: def __getitem__(self, url): """Load value at this URL """ record = self.db.webpage.find_one({'_id': url}) if record: #return record['result'] return pickle.loads(zlib.decompress(record['result'])) else: raise KeyError(url + ' does not exist') def __setitem__(self, url, result): """Save value for this URL """ #record = {'result': result, 'timestamp': datetime.utcnow()} record = {'result': Binary(zlib.compress(pickle.dumps(result))), 'timestamp': datetime.utcnow()} self.db.webpage.update({'_id': url}, {'$set': record}, upsert=True)

3.6緩存測試

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 3mongo_cache.py Downloading: http://127.0.0.1:8000/places/Downloading: http://127.0.0.1:8000/places/default/index/1Downloading: http://127.0.0.1:8000/places/default/index/2...Downloading: http://127.0.0.1:8000/places/default/view/Algeria-4Downloading: http://127.0.0.1:8000/places/default/view/Albania-3Downloading: http://127.0.0.1:8000/places/default/view/Aland-Islands-2Downloading: http://127.0.0.1:8000/places/default/view/Afghanistan-1real 0m59.239suser 0m1.164ssys 0m0.108swu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 3mongo_cache.py real 0m0.695suser 0m0.408ssys 0m0.044s

可以看出，用數據庫緩存的讀取時間是磁盤緩存的兩倍，但成功地避免了磁盤緩存的缺點。

3.7MongoDB緩存完整代碼

try: import cPickle as pickleexcept ImportError: import pickleimport zlibfrom datetime import datetime, timedeltafrom pymongo import MongoClientfrom bson.binary import Binaryfrom link_crawler import link_crawlerclass MongoCache: """ Wrapper around MongoDB to cache downloads >>> cache = MongoCache() >>> cache.clear() >>> url = 'http://example.webscraping.com' >>> result = {'html': '...'} >>> cache[url] = result >>> cache[url]['html'] == result['html'] True >>> cache = MongoCache(expires=timedelta()) >>> cache[url] = result >>> # every 60 seconds is purged http://docs.mongodb.org/manual/core/index-ttl/ >>> import time; time.sleep(60) >>> cache[url] Traceback (most recent call last): ... KeyError: 'http://example.webscraping.com does not exist' """ def __init__(self, client=None, expires=timedelta(days=30)): """ client: mongo database client expires: timedelta of amount of time before a cache entry is considered expired """ # if a client object is not passed # then try connecting to mongodb at the default localhost port self.client = MongoClient('localhost', 27017) if client is None else client #create collection to store cached webpages, # which is the equivalent of a table in a relational database self.db = self.client.cache self.db.webpage.create_index('timestamp100s', expireAfterSeconds=expires.total_seconds()) #timestamp def __contains__(self, url): try: self[url] except KeyError: return False else: return True def __getitem__(self, url): """Load value at this URL """ record = self.db.webpage.find_one({'_id': url}) if record: #return record['result'] return pickle.loads(zlib.decompress(record['result'])) else: raise KeyError(url + ' does not exist') def __setitem__(self, url, result): """Save value for this URL """ #record = {'result': result, 'timestamp': datetime.utcnow()} record = {'result': Binary(zlib.compress(pickle.dumps(result))), 'timestamp100s': datetime.utcnow()} #timestamp self.db.webpage.update({'_id': url}, {'$set': record}, upsert=True) def clear(self): self.db.webpage.drop() print 'drop() successful'if __name__ == '__main__': #link_crawler('http://example.webscraping.com/', '/(index|view)', cache=MongoCache()) #link_crawler('http://127.0.0.1:8000/places/', '/places/default/(index|view)/', cache=MongoCache()) link_crawler('http://127.0.0.1:8000/places/', '/places/default/(index|view)/', cache=MongoCache(expires=timedelta(seconds=100)))

Wu_Being 博客聲明：本人博客歡迎轉載，請標明博客原文和原鏈接！謝謝！【Python爬蟲系列】《【Python爬蟲3】在下載的本地緩存做爬蟲》http://blog.csdn.net/u014134180/article/details/55506984 Python爬蟲系列的GitHub代碼文件：https://github.com/1040003585/WebScrapingWithPython

上一篇：【Python爬蟲4】并發并行下載

下一篇：【Python爬蟲2】網頁數據提取