国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 編程 > Python > 正文

python書籍信息爬蟲實例

2020-01-04 15:37:08
字體:
來源:轉載
供稿:網友

python書籍信息爬蟲示例,供大家參考,具體內容如下

背景說明

需要收集一些書籍信息,以豆瓣書籍條目作為源,得到一些有效書籍信息,并保存到本地數據庫。

獲取書籍分類標簽

具體可參考這個鏈接:
https://book.douban.com/tag/?view=type

然后將這些分類標簽鏈接存到本地某個文件,存儲內容如下

https://book.douban.com/tag/小說https://book.douban.com/tag/外國文學https://book.douban.com/tag/文學https://book.douban.com/tag/隨筆https://book.douban.com/tag/中國文學https://book.douban.com/tag/經典https://book.douban.com/tag/日本文學https://book.douban.com/tag/散文https://book.douban.com/tag/村上春樹https://book.douban.com/tag/詩歌https://book.douban.com/tag/童話......

獲取書籍信息,并保存本地數據庫

假設已經建好mysql表,如下:

CREATE TABLE `book_info` ( `id` int(11) NOT NULL AUTO_INCREMENT, `bookid` varchar(64) NOT NULL COMMENT 'book ID', `tag` varchar(32) DEFAULT '' COMMENT '分類目錄', `bookname` varchar(256) NOT NULL COMMENT '書名', `subname` varchar(256) NOT NULL COMMENT '二級書名', `author` varchar(256) DEFAULT '' COMMENT '作者', `translator` varchar(256) DEFAULT '' COMMENT '譯者', `press` varchar(128) DEFAULT '' COMMENT '出版社', `publishAt` date DEFAULT '0000-00-00' COMMENT '出版日期', `stars` float DEFAULT '0' COMMENT '評分', `price_str` varchar(32) DEFAULT '' COMMENT '價格string', `hotcnt` int(11) DEFAULT '0' COMMENT '評論人數', `bookdesc` varchar(8192) DEFAULT NULL COMMENT '簡介', `updateAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '修改日期', PRIMARY KEY (`id`), UNIQUE KEY `idx_bookid` (`bookid`), KEY `idx_bookname` (`bookname`), KEY `hotcnt` (`hotcnt`), KEY `stars` (`stars`), KEY `idx_tag` (`tag`)) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='書籍信息';

并已實現相關爬蟲邏輯,主要用到了BeautifulSoup包,如下:

#!/usr/bin/python# coding: utf-8import reimport loggingimport requestsimport pymysqlimport randomimport timeimport datetimefrom hashlib import md5from bs4 import BeautifulSouplogging.basicConfig(level=logging.INFO,     format='[%(levelname)s][%(name)s][%(asctime)s]%(message)s',     datefmt='%Y-%m-%d %H:%M:%S')class DestDB: Host = "192.168.1.10" DB = "spider" Table = "book_info" User = "test" Pwd = "123456"def connect_db(host, db, user, pwd): conn = pymysql.connect(  host=host,  user=user,  passwd=pwd,  db=db,  charset='utf8',  connect_timeout=3600) #,#  cursorclass=pymysql.cursors.DictCursor) conn.autocommit(True) return conndef disconnect_db(conn, cursor): cursor.close() conn.close()#提取評價人數,如果評價人數少于10人,按10人處理def hotratings(person): try:  ptext = person.get_text().split()[0]  pc = int(ptext[1:len(ptext)-4]) except ValueError:  pc = int(10) return pc# 持久化到數據庫def save_to_db(tag, book_reslist): dest_conn = connect_db(DestDB.Host, DestDB.DB, DestDB.User, DestDB.Pwd) dest_cursor = dest_conn.cursor() isql = "insert ignore into book_info " isql += "(`bookid`,`tag`,`author`,`translator`,`bookname`,`subname`,`press`," isql += "`publishAt`,`price_str`,`stars`,`hotcnt`,`bookdesc`) values " isql += ",".join(["(%s)" % ",".join(['%s']*12)]*len(book_reslist)) values = [] for row in book_reslist:  # 暫時將md5(bookname+author)作為bookid唯一指  bookid = md5(("%s_%s"%(row[0],row[2])).encode('utf-8')).hexdigest()  values.extend([bookid, tag]+row[:10]) dest_cursor.execute(isql, tuple(values)) disconnect_db(dest_conn, dest_cursor)# 處理每一次訪問的頁面def do_parse(tag, url): page_data = requests.get(url) soup = BeautifulSoup(page_data.text.encode("utf-8"), "lxml") # 提取標簽信息 tag = url.split("?")[0].split("/")[-1] # 抓取作者,出版社信息 details = soup.select("#subject_list > ul > li > div.info > div.pub") # 抓取評分 scores = soup.select("#subject_list > ul > li > div.info > div.star.clearfix > span.rating_nums") # 抓取評價人數 persons = soup.select("#subject_list > ul > li > div.info > div.star.clearfix > span.pl") # 抓取書名 booknames = soup.select("#subject_list > ul > li > div.info > h2 > a") # 抓取簡介  descs = soup.select("#subject_list > ul > li > div.info > p") # 從標簽信息中分離內容 book_reslist = [] for detail, score, personCnt, bookname, desc in zip(details, scores, persons, booknames, descs):  try:   subtitle = ""   title_strs = [s.replace('/n', '').strip() for s in bookname.strings]   title_strs = [s for s in title_strs if s]   # 部分書籍有二級書名   if not title_strs:    continue   elif len(title_strs) >= 2:    bookname, subtitle = title_strs[:2]   else:    bookname = title_strs[0]   # 評分人數   hotcnt = hotratings(personCnt)   desc = desc.get_text()   stars = float('%.1f' % float(score.get_text() if score.get_text() else "-1"))   author, translator, press, publishAt, price = [""]*5   detail_texts = detail.get_text().replace('/n', '').split("/")   detail_texts = [s.strip() for s in detail_texts]   # 部分書籍無譯者信息   if len(detail_texts) == 4:    author, press, publishAt, price = detail_texts[:4]   elif len(detail_texts) >= 5:    author, translator, press, publishAt, price = detail_texts[:5]   else:    continue   # 轉換出版日期為date類型   if re.match('^[/d]{4}-[/d]{1,2}', publishAt):    dts = publishAt.split('-')    publishAt = datetime.date(int(dts[0]), int(dts[1]), 1)   else:    publishAt = datetime.date(1000, 1, 1)   book_reslist.append([author, translator, bookname, subtitle, press,          publishAt, price, stars, hotcnt, desc])  except Exception as e:   logging.error(e) logging.info("insert count: %d" % len(book_reslist)) if len(book_reslist) > 0:  save_to_db(tag, book_reslist)  book_reslist = [] return len(details)def main(): with open("book_tags.txt") as fd:  tags = fd.readlines()  for tag in tags:   tag = tag.strip()   logging.info("current tag url: %s" % tag)   for idx in range(0, 1000000, 20):    try:     url = "%s?start=%d&type=T" % (tag.strip(), idx)     cnt = do_parse(tag.split('/')[-1], url)     if cnt < 10:      break     # 睡眠若干秒,降低訪問頻率     time.sleep(random.randint(10, 15))    except Exception as e:     logging.warn("outer_err: %s" % e)   time.sleep(300)if __name__ == "__main__": main()

小結

以上代碼基于python3環境來運行;
需要首先安裝BeautifulSoup: pip install bs4
爬取過程中需要控制好訪問頻率;
需要對一些信息進行異常處理,比如譯者信息、評論人數等。

以上就是本文的全部內容,希望對大家的學習有所幫助,也希望大家多多支持VEVB武林網。


注:相關教程知識閱讀請移步到python教程頻道。
發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
主站蜘蛛池模板: 会东县| 和顺县| 诸暨市| 长丰县| 商都县| 平凉市| 夹江县| 门源| 文水县| 佛冈县| 右玉县| 新郑市| 连山| 成武县| 天气| 贺兰县| 台前县| 金门县| 象山县| 鄂托克旗| 迁西县| 鄯善县| 荣昌县| 涿州市| 清原| 吴桥县| 防城港市| 东台市| 什邡市| 景泰县| 金华市| 锦屏县| 拜城县| 平山县| 揭西县| 阿拉善左旗| 于田县| 龙岩市| 融水| 廊坊市| 渑池县|