国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁(yè) > 編程 > Python > 正文

Python實(shí)現(xiàn)的爬蟲功能代碼

2019-11-25 16:04:11
字體:
來(lái)源:轉(zhuǎn)載
供稿:網(wǎng)友

本文實(shí)例講述了Python實(shí)現(xiàn)的爬蟲功能。分享給大家供大家參考,具體如下:

主要用到urllib2、BeautifulSoup模塊

#encoding=utf-8import reimport requestsimport urllib2import datetimeimport MySQLdbfrom bs4 import BeautifulSoupimport sysreload(sys)sys.setdefaultencoding("utf-8")class Splider(object):  def __init__(self):  print u'開始爬取內(nèi)容...'  ##用來(lái)獲取網(wǎng)頁(yè)源代碼  def getsource(self,url):  headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2652.0 Safari/537.36'}  req = urllib2.Request(url=url,headers=headers)  socket = urllib2.urlopen(req)  content = socket.read()  socket.close()  return content  ##changepage用來(lái)生產(chǎn)不同頁(yè)數(shù)的鏈接  def changepage(self,url,total_page):    now_page = int(re.search('page/(/d+)',url,re.S).group(1))  page_group = []  for i in range(now_page,total_page+1):    link = re.sub('page/(/d+)','page/%d' % i,url,re.S)    page_group.append(link)  return page_group  #獲取字內(nèi)容  def getchildrencon(self,child_url):  conobj = {}  content = self.getsource(child_url)  soup = BeautifulSoup(content, 'html.parser', from_encoding='utf-8')  content = soup.find('div',{'class':'c-article_content'})  img = re.findall('src="(.*?)"',str(content),re.S)  conobj['con'] = content.get_text()  conobj['img'] = (';').join(img)  return conobj  ##獲取內(nèi)容  def getcontent(self,html_doc):  soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')  tag = soup.find_all('div',{'class':'promo-feed-headline'})  info = {}  i = 0  for link in tag:    info[i] = {}    title_desc = link.find('h3')    info[i]['title'] = title_desc.get_text()    post_date = link.find('div',{'class':'post-date'})    pos_d = post_date['data-date'][0:10]    info[i]['content_time'] = pos_d    info[i]['source'] = 'whowhatwear'    source_link = link.find('a',href=re.compile(r"section=fashion-trends"))    source_url = 'http://www.whowhatwear.com'+source_link['href']    info[i]['source_url'] = source_url    in_content = self.getsource(source_url)    in_soup = BeautifulSoup(in_content, 'html.parser', from_encoding='utf-8')    soup_content = in_soup.find('section',{'class':'widgets-list-content'})    info[i]['content'] = soup_content.get_text().strip('/n')    text_con = in_soup.find('section',{'class':'text'})    summary = text_con.get_text().strip('/n') if text_con.text != None else NULL    info[i]['summary'] = summary[0:200]+'...';    img_list = re.findall('src="(.*?)"',str(soup_content),re.S)    info[i]['imgs'] = (';').join(img_list)    info[i]['create_time'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")    i+=1  #print info  #exit()  return info  def saveinfo(self,content_info):  conn = MySQLdb.Connect(host='127.0.0.1',user='root',passwd='123456',port=3306,db='test',charset='utf8')  cursor = conn.cursor()  for each in content_info:    for k,v in each.items():    sql = "insert into t_fashion_spider2(`title`,`summary`,`content`,`content_time`,`imgs`,`source`,`source_url`,`create_time`) values ('%s','%s','%s','%s','%s','%s','%s','%s')" % (MySQLdb.escape_string(v['title']),MySQLdb.escape_string(v['summary']),MySQLdb.escape_string(v['content']),v['content_time'],v['imgs'],v['source'],v['source_url'],v['create_time'])    cursor.execute(sql)  conn.commit()  cursor.close()  conn.close()if __name__ == '__main__':  classinfo = []  p_num = 5  url = 'http://www.whowhatwear.com/section/fashion-trends/page/1'  jikesplider = Splider()  all_links = jikesplider.changepage(url,p_num)  for link in all_links:  print u'正在處理頁(yè)面:' + link  html = jikesplider.getsource(link)  info = jikesplider.getcontent(html)  classinfo.append(info)  jikesplider.saveinfo(classinfo)

更多關(guān)于Python相關(guān)內(nèi)容可查看本站專題:《Python Socket編程技巧總結(jié)》、《Python數(shù)據(jù)結(jié)構(gòu)與算法教程》、《Python函數(shù)使用技巧總結(jié)》、《Python字符串操作技巧匯總》、《Python入門與進(jìn)階經(jīng)典教程》及《Python文件與目錄操作技巧匯總

希望本文所述對(duì)大家Python程序設(shè)計(jì)有所幫助。

發(fā)表評(píng)論 共有條評(píng)論
用戶名: 密碼:
驗(yàn)證碼: 匿名發(fā)表
主站蜘蛛池模板: 石家庄市| 通河县| 牡丹江市| 大化| 沭阳县| 雷山县| 太仆寺旗| 洛阳市| 邹平县| 瑞安市| 鞍山市| 平乡县| 志丹县| 湘阴县| 香港| 玛纳斯县| 宜川县| 万安县| 金塔县| 汶川县| 清水河县| 乌拉特后旗| 忻州市| 湖南省| 贡山| 来安县| 长沙县| 德兴市| 衡南县| 凭祥市| 汕头市| 扬中市| 沈阳市| 北票市| 西城区| 郁南县| 开鲁县| 绥芬河市| 定结县| 文成县| 平阳县|