国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 編程 > Python > 正文

對Python3 解析html的幾種操作方式小結

2020-02-16 01:11:03
字體:
來源:轉載
供稿:網友

解析html是爬蟲后的重要的一個處理數據的環節。一下記錄解析html的幾種方式。

先介紹基礎的輔助函數,主要用于獲取html并輸入解析后的結束

#把傳遞解析函數,便于下面的修改def get_html(url, paraser=bs4_paraser): headers = {  'Accept': '*/*',  'Accept-Encoding': 'gzip, deflate, sdch',  'Accept-Language': 'zh-CN,zh;q=0.8',  'Host': 'www.360kan.com',  'Proxy-Connection': 'keep-alive',  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' } request = urllib2.Request(url, headers=headers) response = urllib2.urlopen(request) response.encoding = 'utf-8' if response.code == 200:  data = StringIO.StringIO(response.read())  gzipper = gzip.GzipFile(fileobj=data)  data = gzipper.read()  value = paraser(data) # open('E:/h5/haPkY0osd0r5UB.html').read()  return value else:  pass  value = get_html('http://www.360kan.com/m/haPkY0osd0r5UB.html', paraser=lxml_parser)for row in value: print row

1,lxml.html的方式進行解析,

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.5. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ. [官網](http://lxml.de/)

def lxml_parser(page): data = [] doc = etree.HTML(page) all_div = doc.xpath('//div[@class="yingping-list-wrap"]') for row in all_div:  # 獲取每一個影評,即影評的item  all_div_item = row.xpath('.//div[@class="item"]') # find_all('div', attrs={'class': 'item'})  for r in all_div_item:   value = {}   # 獲取影評的標題部分   title = r.xpath('.//div[@class="g-clear title-wrap"][1]')   value['title'] = title[0].xpath('./a/text()')[0]   value['title_href'] = title[0].xpath('./a/@href')[0]   score_text = title[0].xpath('./div/span/span/@style')[0]   score_text = re.search(r'/d+', score_text).group()   value['score'] = int(score_text) / 20   # 時間   value['time'] = title[0].xpath('./div/span[@class="time"]/text()')[0]   # 多少人喜歡   value['people'] = int(     re.search(r'/d+', title[0].xpath('./div[@class="num"]/span/text()')[0]).group())   data.append(value) return data

2,使用BeautifulSoup,不多說了,大家網上找資料看看

def bs4_paraser(html): all_value = [] value = {} soup = BeautifulSoup(html, 'html.parser') # 獲取影評的部分 all_div = soup.find_all('div', attrs={'class': 'yingping-list-wrap'}, limit=1) for row in all_div:  # 獲取每一個影評,即影評的item  all_div_item = row.find_all('div', attrs={'class': 'item'})  for r in all_div_item:   # 獲取影評的標題部分   title = r.find_all('div', attrs={'class': 'g-clear title-wrap'}, limit=1)   if title is not None and len(title) > 0:    value['title'] = title[0].a.string    value['title_href'] = title[0].a['href']    score_text = title[0].div.span.span['style']    score_text = re.search(r'/d+', score_text).group()    value['score'] = int(score_text) / 20    # 時間    value['time'] = title[0].div.find_all('span', attrs={'class': 'time'})[0].string    # 多少人喜歡    value['people'] = int(      re.search(r'/d+', title[0].find_all('div', attrs={'class': 'num'})[0].span.string).group())   # print r   all_value.append(value)   value = {} return all_value            
發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
主站蜘蛛池模板: 贵州省| 永兴县| 河间市| 民和| 天峻县| 通化市| 山西省| 双峰县| 新乡市| 拉萨市| 广宗县| 南川市| 锦屏县| 嘉荫县| 邯郸县| 武邑县| 江达县| 晋宁县| 新巴尔虎左旗| 定南县| 湘潭县| 玉树县| 呼伦贝尔市| 丁青县| 弋阳县| 海城市| 衡阳县| 呼伦贝尔市| 拜城县| 樟树市| 鹤山市| 虎林市| 怀仁县| 东丽区| 周至县| 泰宁县| 汝城县| 镇康县| 肇源县| 清水县| 永平县|