1為鏈接爬蟲添加緩存支持2磁盤緩存1用磁盤緩存的實現2緩存測試3節省磁盤空間4清理過期數據5用磁盤緩存的缺點3數據庫緩存1NoSQL是什么2安裝MongoDB3MongoDB概述4MongoDB緩存實現5壓縮存儲6緩存測試7MongoDB緩存完整代碼
上篇文章,我們學習了如何提取網頁中的數據,以及將提取結果存到表格中。如果我們還想提取另一字段,則需要重新再下載整個網頁,這對我們這個小型的示例網站問題不大,但對于數百萬個網頁的網站而言來說就要消耗幾個星期的時間。所以,我們可以先對網頁進行緩存,就使得每個網頁只下載一次。
1為鏈接爬蟲添加緩存支持
我們將downloader重構一類,這樣參數只需在構造方法中設置一次,就能在后續多次復用,在URL下載之前進行緩存檢查,并把限速功能移到函數內部。在Downloader類的call特殊方法實現了下載前先檢查緩存,如果已經定義該URL緩存則再檢查下載中是否遇到了服務端錯誤,如果都沒問題表明緩存結果可用,否則都需要正常下載該URL存到緩存中。downloader方法返回添加了HTTP狀態碼,以便緩存中存儲錯誤機校驗。如果不需要限速或緩存的話,你可以直接調用該方法,這樣就不會通過call方法調用了。class Downloader: def __init__(self, delay=5, user_agent='Wu_Being', PRoxies=None, num_retries=1, cache=None): self.throttle = Throttle(delay) self.user_agent = user_agent self.proxies = proxies self.num_retries = num_retries self.cache = cache def __call__(self, url): result = None if self.cache: try: result = self.cache[url] except KeyError: # url is not available in cache pass else: if self.num_retries > 0 and 500 <= result['code'] < 600: # server error so ignore result from cache and re-download result = None if result is None: # result was not loaded from cache so still need to download self.throttle.wait(url) proxy = random.choice(self.proxies) if self.proxies else None headers = {'User-agent': self.user_agent} result = self.download(url, headers, proxy=proxy, num_retries=self.num_retries) if self.cache: # save result to cache self.cache[url] = result return result['html'] def download(self, url, headers, proxy, num_retries, data=None): print 'Downloading:', url ... return {'html': html, 'code': code}class Throttle: def __init__(self, delay): ... def wait(self, url): ...為了支持緩存功能,鏈接爬蟲代碼也需用一些微調,包括添加cache參數、移除限速以及將download函數替換為新的類。
from downloader import Downloaderdef link_crawler(... cache=None): crawl_queue = [seed_url] seen = {seed_url: 0} # track how many URL's have been downloaded num_urls = 0 rp = get_robots(seed_url) #cache.clear() ############################### D = Downloader(delay=delay, user_agent=user_agent, proxies=proxies, num_retries=num_retries, cache=cache) while crawl_queue: url = crawl_queue.pop() depth = seen[url] # check url passes robots.txt restrictions if rp.can_fetch(user_agent, url): html = D(url) ###def __call__(self, url): links = [] ...def normalize(seed_url, link): ...def same_domain(url1, url2): ...def get_robots(url): ...def get_links(html): ..."""if __name__ == '__main__': link_crawler('http://example.webscraping.com', '/(index|view)', delay=0, num_retries=1, user_agent='BadCrawler') link_crawler('http://example.webscraping.com', '/(index|view)', delay=0, num_retries=1, max_depth=1, user_agent='GoodCrawler')"""現在,這個支持緩存的網絡爬蟲的基本架構已經準備好了,下面就要開始構建實際的緩存功能了。
2磁盤緩存
| 操作系統 | 文件系統 | 非法文件名字符 | 文件名最大長度 |
| linux | Ext3/Ext4 | /和/0 | 255個字節 |
| OS X | HFS Plus | :和/0 | 255個UTF-16編碼單元 |
| Windows | NTFS | /、/、?、:、*、"、>、<和` | ` |
為了保證在不同文件系統中,我們的文件路徑都是安全的,就需要把除數字、字母和基本符號的其他字符替換為下劃線。
>>> import re>>> url="http://example.webscraping.com/default/view/australia-1">>> re.sub('[^/0-9a-zA-Z/-,.;_ ]','_',url)'http_//example.webscraping.com/default/view/australia-1'此外,文件名及其目錄長度需要限制在255個字符以內。
>>> filename=re.sub('[^/0-9a-zA-Z/-,.;_ ]','_',url)>>> filename='/'.join(segment[:255] for segment in filename.split('/'))>>> print filenamehttp_//example.webscraping.com/default/view/australia-1>>> print '#'.join(segment[:5] for segment in filename.split('/'))http_##examp#defau#view#austr>>> 還有一種邊界情況,就是URL以斜杠結尾。這樣分割URL后就會造成一個非法的文件名。例如: - http://example.webscraping.com/index/ - http://example.webscraping.com/index/1
對于第一個URL可以在后面添加index.html作為文件名,所以可以把index作為目錄名,1為子目錄名,index.html為文件名。
>>> import urlparse>>> components=urlparse.urlsplit('http://exmaple.scraping.com/index/')>>> print componentsSplitResult(scheme='http', netloc='exmaple.scraping.com', path='/index/', query='', fragment='')>>> print components.path/index/>>> path=components.path>>> if not path:... path='/index.html'... elif path.endswith('/'):... path+='index.html'... >>> filename=components.netloc+path+components.query>>> filename'exmaple.scraping.com/index/index.html'>>> 2.1用磁盤緩存的實現
現在可以把URL到目錄和文件名完整映射邏輯結合起來,就形成了磁盤緩存的主要部分。該構造方法傳入了用于設定緩存位置的參數,然后在url_to_path方法中應用了前面討論的文件名限制。
from link_crawler import link_crawlerclass DiskCache: def __init__(self, cache_dir='cache', ...): """ cache_dir: the root level folder for the cache """ self.cache_dir = cache_dir ... def url_to_path(self, url): """Create file system path for this URL """ components = urlparse.urlsplit(url) # when empty path set to /index.html path = components.path if not path: path = '/index.html' elif path.endswith('/'): path += 'index.html' filename = components.netloc + path + components.query # replace invalid characters filename = re.sub('[^/0-9a-zA-Z/-.,;_ ]', '_', filename) # restrict maximum number of characters filename = '/'.join(segment[:255] for segment in filename.split('/')) return os.path.join(self.cache_dir, filename) #拼接當前目錄和文件名為完整目錄 def __getitem__(self, url): ... def __setitem__(self, url, result): ... def __delitem__(self, url): ... def has_expired(self, timestamp): ... def clear(self): ...if __name__ == '__main__': link_crawler('http://example.webscraping.com/', '/(index|view)', cache=DiskCache())現在我們還缺少根據文件名存取數據的方法,就是Downloader類result=cache[url]和cache[url]=result的接口方法:__getitem__()和__setitem__()兩個特殊方法。
import pickleclass DiskCache: def __init__(self, cache_dir='cache', expires=timedelta(days=30), compress=True): ... def url_to_path(self, url): ... def __getitem__(self, url): ... def __setitem__(self, url, result): """Save data to disk for this url """ path = self.url_to_path(url) folder = os.path.dirname(path) if not os.path.exists(folder): os.makedirs(folder) with open(path, 'wb') as fp: fp.write(pickle.dumps(result))在__setitem__()中,我們使用url_to_path()方法將URL映射為安全文件名,在必要情況下還需要創建目錄。這里使用的pickle模塊會把輸入轉化為字符串(序列化),然后保存到磁盤中。
import pickleclass DiskCache: def __init__(self, cache_dir='cache', expires=timedelta(days=30), compress=True): ... def url_to_path(self, url): ... def __getitem__(self, url): """Load data from disk for this URL """ path = self.url_to_path(url) if os.path.exists(path): with open(path, 'rb') as fp: return pickle.loads(fp.read()) else: # URL has not yet been cached raise KeyError(url + ' does not exist') def __setitem__(self, url, result): ...在__getitem__()中,還是先用url_to_path()方法將URL映射為安全文件名。然后檢查文件是否存在,如果存在則加載內容,并執行反序列化,恢復其原始數據類型;如果不存在,則說明緩存中還沒有該URL的數據,此時會拋出KeyError異常。
2.2緩存測試
可以在python命令前加time計時。我們可以發現,如果是在本地服務器的網站,當緩存為空時爬蟲實際耗時0m58.710s,第二次運行全部從緩存讀取花了0m0.221s,快了265多倍。如果是爬取遠程服務器的網站的數據時,將會耗更多時間。
wu_being@Ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 2disk_cache_Nozip127.py Downloading: http://127.0.0.1:8000/places/Downloading: http://127.0.0.1:8000/places/default/index/1...Downloading: http://127.0.0.1:8000/places/default/view/Afghanistan-1real 0m58.710suser 0m0.684ssys 0m0.120swu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 2disk_cache_Nozip127.py real 0m0.221suser 0m0.204ssys 0m0.012s2.3節省磁盤空間
為節省緩存占用空間,我們可以對下載的HTML文件進行壓縮處理,使用zlib壓縮序列化字符串即可。
fp.write(zlib.compress(pickle.dumps(result)))從磁盤加載后解壓的代碼如下:
return pickle.loads(zlib.decompress(fp.read()))壓縮所有網頁之后,緩存占用大小2.8 MB下降到821.2 KB,耗時略有增加。
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 2disk_cache.py Downloading: http://127.0.0.1:8000/places/Downloading: http://127.0.0.1:8000/places/default/index/1...Downloading: http://127.0.0.1:8000/places/default/view/Afghanistan-1real 1m0.011suser 0m0.800ssys 0m0.104swu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 2disk_cache.py real 0m0.252suser 0m0.228ssys 0m0.020swu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ 2.4清理過期數據
本節中,我們將為緩存數據添加過期時間,以便爬蟲知道何時需要重新下載網頁。在構造方法中,我們使用timedelta對象將默認過期時間設置為30天,在__set__方法中把當前時間戳保存在序列化數據中,在__get__方法中對比當前時間和緩存時間,檢查是否過期。
from datetime import datetime, timedeltaclass DiskCache: def __init__(self, cache_dir='cache', expires=timedelta(days=30), compress=True): """ cache_dir: the root level folder for the cache expires: timedelta of amount of time before a cache entry is considered expired compress: whether to compress data in the cache """ self.cache_dir = cache_dir self.expires = expires self.compress = compress def __getitem__(self, url): """Load data from disk for this URL """ path = self.url_to_path(url) if os.path.exists(path): with open(path, 'rb') as fp: data = fp.read() if self.compress: data = zlib.decompress(data) result, timestamp = pickle.loads(data) if self.has_expired(timestamp): raise KeyError(url + ' has expired') return result else: # URL has not yet been cached raise KeyError(url + ' does not exist') def __setitem__(self, url, result): """Save data to disk for this url """ path = self.url_to_path(url) folder = os.path.dirname(path) if not os.path.exists(folder): os.makedirs(folder) data = pickle.dumps((result, datetime.utcnow())) if self.compress: data = zlib.compress(data) with open(path, 'wb') as fp: fp.write(data) ... def has_expired(self, timestamp): """Return whether this timestamp has expired """ return datetime.utcnow() > timestamp + self.expires為了測試時間功能,我們可以將其縮短為5秒,如下操作:
""" Dictionary interface that stores cached values in the file system rather than in memory. The file path is formed from an md5 hash of the key. """>>> from disk_cache import DiskCache>>> cache=DiskCache()>>> url='http://www.baidu.com'>>> result={'html':'<html>...','code':200}>>> cache[url]=result>>> cache[url]{'code': 200, 'html': '<html>...'}>>> cache[url]['html']==result['html']True>>> >>> from datetime import timedelta>>> cache2=DiskCache(expires=timedelta(seconds=5))>>> url2='https://www.baidu.sss'>>> result2={'html':'<html>..ss.','code':500}>>> cache2[url2]=result2>>> cache2[url2]{'code': 200, 'html': '<html>...'}>>> cache2[url2]{'code': 200, 'html': '<html>...'}>>> cache2[url2]{'code': 200, 'html': '<html>...'}>>> cache2[url2]{'code': 200, 'html': '<html>...'}>>> cache2[url2]Traceback (most recent call last): File "<stdin>", line 1, in <module> File "disk_cache.py", line 57, in __getitem__ raise KeyError(url + ' has expired')KeyError: 'http://www.baidu.com has expired'>>> cache2.clear()2.5用磁盤緩存的缺點
由于受制于文件系統的限制,之前我們將URL映射為安全文件名,然而這樣又會引發一些問題: - 有些URL會被映射為相同的文件名。比如URL:.../count.asp?a+b,.../count.asp?a*b。 - URL截斷255個字符的文件名也可能相同。因為URL可以超過2000下字符。
使用URL哈希值為文件名可以帶來一定的改善。這樣也有一些問題: - 每個卷和每個目錄下的文件數量是有限制的。FAT32文件系統每個目錄的最大文件數65535,但可以分割到不同目錄下。 - 文件系統可存儲的文件總數也是有限的。ext4分區目前支持略多于1500萬個文件,而一個大型網站往往擁有超過1億個網頁。
要想避免這些問題,我們需要把多個緩存網頁合并到一個文件中,并使用類似B+樹的算法進行索引。但我們不會自己實現這種算法,而是在下一節中介紹已實現這類算法的數據庫。
3數據庫緩存
爬取時,我們可能需要緩存大量數據,但又無須任何復雜的連接操作,因此我們將選用NoSQL數據庫,這種數據庫比傳統的關系型數據庫更容易擴展。在本節中,我們將選用目前非常流行的MongoDB作為緩存數據庫。
3.1NoSQL是什么
NoSQL全稱為Not Only SQL,是一種相對較新的數據庫設計方式。傳統的關系模型使用是固定模式,并將數據分割到各個表中。然而,對于大數據集的情況,數據量太大使其難以存放在單一服務器中,此時就需要擴展到多臺服務器。不過,關系模型對于這種擴展的支持并不夠好,因為在查詢多個表時,數據可能在不同的服務器中。相反,NoSQL數據庫通常是無模式的,從設計之初就考慮了跨服務器無縫分片的問題。在NoSQL中,有多種方式可以實現該目標,分別是:- 列數據存儲(如HBase); - 鍵值對存儲(如Redis); - 圖形數據庫(如Neo4j); - 面向文檔的數據庫(如MongoDB)。
3.2安裝MongoDB
MongoDB可以從https://www.mongodb.org/downloads 下載。然后安裝其Python封裝庫:
pip install pymongo檢測安裝是否成功,在本地啟動MongoDB服務器:
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ mongod -dbpath MongoD2017-01-17T21:20:46.224+0800 [initandlisten] MongoDB starting : pid=1978 port=27017 dbpath=MongoD 64-bit host=ubuntukylin642017-01-17T21:20:46.224+0800 [initandlisten] db version v2.6.102017-01-17T21:20:46.224+0800 [initandlisten] git version: nogitversion2017-01-17T21:20:46.225+0800 [initandlisten] OpenSSL version: OpenSSL 1.0.2g 1 Mar 20162017-01-17T21:20:46.225+0800 [initandlisten] build info: Linux lgw01-12 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015 x86_64 BOOST_LIB_VERSION=1_582017-01-17T21:20:46.225+0800 [initandlisten] allocator: tcmalloc2017-01-17T21:20:46.225+0800 [initandlisten] options: { storage: { dbPath: "MongoD" } }2017-01-17T21:20:46.269+0800 [initandlisten] journal dir=MongoD/journal2017-01-17T21:20:46.270+0800 [initandlisten] recover : no journal files present, no recovery needed2017-01-17T21:20:49.126+0800 [initandlisten] preallocateIsFaster=true 33.722017-01-17T21:20:51.932+0800 [initandlisten] preallocateIsFaster=true 32.72017-01-17T21:20:55.729+0800 [initandlisten] preallocateIsFaster=true 32.362017-01-17T21:20:55.730+0800 [initandlisten] preallocateIsFaster check took 9.459 secs2017-01-17T21:20:55.730+0800 [initandlisten] preallocating a journal file MongoD/journal/prealloc.02017-01-17T21:20:58.042+0800 [initandlisten] File Preallocator Progress: 608174080/1073741824 56%2017-01-17T21:21:03.290+0800 [initandlisten] File Preallocator Progress: 744488960/1073741824 69%2017-01-17T21:21:08.043+0800 [initandlisten] File Preallocator Progress: 954204160/1073741824 88%2017-01-17T21:21:18.347+0800 [initandlisten] preallocating a journal file MongoD/journal/prealloc.12017-01-17T21:21:21.166+0800 [initandlisten] File Preallocator Progress: 639631360/1073741824 59%2017-01-17T21:21:26.328+0800 [initandlisten] File Preallocator Progress: 754974720/1073741824 70%...然后,在Python中,使用MongoDB的默認端口嘗試連接MongoDB:
>>> from pymongo import MongoClient>>> client=MongoClient('localhost',27017)3.3MongoDB概述
下面是MongoDB示例代碼:
>>> from pymongo import MongoClient>>> client=MongoClient('localhost',27017)>>> url='http://www.baidu.com/view/China-47'>>> html='...<html>...'>>> db=client.cache>>> db.webpage.insert({'url':url,'html':html})ObjectId('587e2cb26b00c10b956e0be9')>>> db.webpage.find_one({'url':url}){u'url': u'http://www.baidu.com/view/China-47', u'_id': ObjectId('587e2cb26b00c10b956e0be9'), u'html': u'...<html>...'}>>> db.webpage.find({'url':url})<pymongo.cursor.Cursor object at 0x7fcde0ca60d0>>>> db.webpage.find({'url':url}).count()1當插入同一條記錄時,MongoDB會欣然接受并執行這次操作,但通過查找發現記錄沒更新。
>>> db.webpage.insert({'url':url,'html':html})ObjectId('587e2d546b00c10b956e0bea')>>> db.webpage.find({'url':url}).count()2>>> db.webpage.find_one({'url':url}){u'url': u'http://www.baidu.com/view/China-47', u'_id': ObjectId('587e2cb26b00c10b956e0be9'), u'html': u'...<html>...'}為了存儲最新的記錄,并避免重復記錄,我們將ID設置為URL,并執行upsert操作。該操作表示當記錄存在時則更新記錄,否則插入新記錄。
>>> >>> new_html='<...>...'>>> db.webpage.update({'_id':url},{'$set':{'html':new_html}},upsert=True){'updatedExisting': True, u'nModified': 1, u'ok': 1, u'n': 1}>>> db.webpage.find_one({'_id':url}){u'_id': u'http://www.baidu.com/view/China-47', u'html': u'<...>...'}>>> db.webpage.find({'_id':url}).count()1>>> db.webpage.update({'_id':url},{'$set':{'html':new_html}},upsert=True){'updatedExisting': True, u'nModified': 0, u'ok': 1, u'n': 1}>>> db.webpage.find({'_id':url}).count()1>>> MongoDB官方文檔:http://docs.mongodb.org/manual/
3.4MongoDB緩存實現
現在我們已經準備好創建基于MongoDB的緩存了,這里使用了和之前的DiskCache類相同的接口。我們在下面構造方法中創建了timestamp索引,在達到給定的時間戳之后,MongoDB的這一便捷功能可以自動過期刪除記錄。
import picklefrom datetime import datetime, timedeltafrom pymongo import MongoClientclass MongoCache: def __init__(self, client=None, expires=timedelta(days=30)): """ client: mongo database client expires: timedelta of amount of time before a cache entry is considered expired """ # if a client object is not passed # then try connecting to mongodb at the default localhost port self.client = MongoClient('localhost', 27017) if client is None else client #create collection to store cached webpages, # which is the equivalent of a table in a relational database self.db = self.client.cache self.db.webpage.create_index('timestamp', expireAfterSeconds=expires.total_seconds()) def __getitem__(self, url): """Load value at this URL """ record = self.db.webpage.find_one({'_id': url}) if record: return record['result'] else: raise KeyError(url + ' does not exist') def __setitem__(self, url, result): """Save value for this URL """ record = {'result': result, 'timestamp': datetime.utcnow()} self.db.webpage.update({'_id': url}, {'$set': record}, upsert=True)下面我們來測試一下這個MongoCache類,我們用默認0時間間隔timedelta()對象進行測試,此時記錄創建后應該會馬上會被刪除,但實際卻沒有。這是因為MongoDB運行機制造成的,MongoDB后臺運行了一個每分鐘檢查一次過期記錄的任務。所以我們可以再等一分鐘,就會發現緩存過期機制已經運行成功了。
>>> from mongo_cache import MongoCache>>> from datetime import timedelta>>> cache=MongoCache(expires=timedelta())>>> result={'html':'.....'}>>> cache[url]=result>>> cache[url]{'html': '.....'}>>> cache[url]{'html': '.....'}>>> import time>>> import time;time.sleep(60)>>> cache[url]Traceback (most recent call last): File "<stdin>", line 1, in <module> File "mongo_cache.py", line 62, in __getitem__ raise KeyError(url + ' does not exist')KeyError: 'http://www.baidu.com/view/China-47 does not exist'>>> 3.5壓縮存儲
import pickleimport zlibfrom bson.binary import Binaryclass MongoCache: def __getitem__(self, url): """Load value at this URL """ record = self.db.webpage.find_one({'_id': url}) if record: #return record['result'] return pickle.loads(zlib.decompress(record['result'])) else: raise KeyError(url + ' does not exist') def __setitem__(self, url, result): """Save value for this URL """ #record = {'result': result, 'timestamp': datetime.utcnow()} record = {'result': Binary(zlib.compress(pickle.dumps(result))), 'timestamp': datetime.utcnow()} self.db.webpage.update({'_id': url}, {'$set': record}, upsert=True)3.6緩存測試
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 3mongo_cache.py Downloading: http://127.0.0.1:8000/places/Downloading: http://127.0.0.1:8000/places/default/index/1Downloading: http://127.0.0.1:8000/places/default/index/2...Downloading: http://127.0.0.1:8000/places/default/view/Algeria-4Downloading: http://127.0.0.1:8000/places/default/view/Albania-3Downloading: http://127.0.0.1:8000/places/default/view/Aland-Islands-2Downloading: http://127.0.0.1:8000/places/default/view/Afghanistan-1real 0m59.239suser 0m1.164ssys 0m0.108swu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/3.下載緩存$ time python 3mongo_cache.py real 0m0.695suser 0m0.408ssys 0m0.044s可以看出,用數據庫緩存的讀取時間是磁盤緩存的兩倍,但成功地避免了磁盤緩存的缺點。
3.7MongoDB緩存完整代碼
try: import cPickle as pickleexcept ImportError: import pickleimport zlibfrom datetime import datetime, timedeltafrom pymongo import MongoClientfrom bson.binary import Binaryfrom link_crawler import link_crawlerclass MongoCache: """ Wrapper around MongoDB to cache downloads >>> cache = MongoCache() >>> cache.clear() >>> url = 'http://example.webscraping.com' >>> result = {'html': '...'} >>> cache[url] = result >>> cache[url]['html'] == result['html'] True >>> cache = MongoCache(expires=timedelta()) >>> cache[url] = result >>> # every 60 seconds is purged http://docs.mongodb.org/manual/core/index-ttl/ >>> import time; time.sleep(60) >>> cache[url] Traceback (most recent call last): ... KeyError: 'http://example.webscraping.com does not exist' """ def __init__(self, client=None, expires=timedelta(days=30)): """ client: mongo database client expires: timedelta of amount of time before a cache entry is considered expired """ # if a client object is not passed # then try connecting to mongodb at the default localhost port self.client = MongoClient('localhost', 27017) if client is None else client #create collection to store cached webpages, # which is the equivalent of a table in a relational database self.db = self.client.cache self.db.webpage.create_index('timestamp100s', expireAfterSeconds=expires.total_seconds()) #timestamp def __contains__(self, url): try: self[url] except KeyError: return False else: return True def __getitem__(self, url): """Load value at this URL """ record = self.db.webpage.find_one({'_id': url}) if record: #return record['result'] return pickle.loads(zlib.decompress(record['result'])) else: raise KeyError(url + ' does not exist') def __setitem__(self, url, result): """Save value for this URL """ #record = {'result': result, 'timestamp': datetime.utcnow()} record = {'result': Binary(zlib.compress(pickle.dumps(result))), 'timestamp100s': datetime.utcnow()} #timestamp self.db.webpage.update({'_id': url}, {'$set': record}, upsert=True) def clear(self): self.db.webpage.drop() print 'drop() successful'if __name__ == '__main__': #link_crawler('http://example.webscraping.com/', '/(index|view)', cache=MongoCache()) #link_crawler('http://127.0.0.1:8000/places/', '/places/default/(index|view)/', cache=MongoCache()) link_crawler('http://127.0.0.1:8000/places/', '/places/default/(index|view)/', cache=MongoCache(expires=timedelta(seconds=100)))Wu_Being 博客聲明:本人博客歡迎轉載,請標明博客原文和原鏈接!謝謝! 【Python爬蟲系列】《【Python爬蟲3】在下載的本地緩存做爬蟲》http://blog.csdn.net/u014134180/article/details/55506984 Python爬蟲系列的GitHub代碼文件:https://github.com/1040003585/WebScrapingWithPython