国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 編程 > Python > 正文

python實現博客文章爬蟲示例

2020-02-23 05:13:07
字體:
來源:轉載
供稿:網友

代碼如下:
#!/usr/bin/python
#-*-coding:utf-8-*-
# JCrawler
# Author: Jam <810441377@qq.com>

import time
import urllib2
from bs4 import BeautifulSoup

# 目標站點
TargetHost = "http://adirectory.blog.com"
# User Agent
UserAgent  = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36'
# 鏈接采集規則
# 目錄鏈接采集規則
CategoryFind    = [{'findMode':'find','findTag':'div','rule':{'id':'cat-nav'}},
                   {'findMode':'findAll','findTag':'a','rule':{}}]
# 文章鏈接采集規則
ArticleListFind = [{'findMode':'find','findTag':'div','rule':{'id':'content'}},
                   {'findMode':'findAll','findTag':'h2','rule':{'class':'title'}},
                   {'findMode':'findAll','findTag':'a','rule':{}}]
# 分頁URL規則
PageUrl  = 'page/#page/'
PageStart = 1
PageStep  = 1
PageStopHtml = '404: Page Not Found'

def GetHtmlText(url):
    request  = urllib2.Request(url)
    request.add_header('Accept', "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp")
    request.add_header('Accept-Encoding', "*")
    request.add_header('User-Agent', UserAgent)
    return urllib2.urlopen(request).read()

def ArrToStr(varArr):
    returnStr = ""
    for s in varArr:
        returnStr += str(s)
    return returnStr


def GetHtmlFind(htmltext, findRule):
    findReturn = BeautifulSoup(htmltext)
    returnText = ""
    for f in findRule:
        if returnText != "":
            findReturn = BeautifulSoup(returnText)
        if f['findMode'] == 'find':
            findReturn = findReturn.find(f['findTag'], f['rule'])
        if f['findMode'] == 'findAll':
            findReturn = findReturn.findAll(f['findTag'], f['rule'])
        returnText = ArrToStr(findReturn)
    return findReturn

def GetCategory():
    categorys = [];
    htmltext = GetHtmlText(TargetHost)
    findReturn = GetHtmlFind(htmltext, CategoryFind)

    for tag in findReturn:
        print "[G]->Category:" + tag.string + "|Url:" + tag['href']

發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
主站蜘蛛池模板: 两当县| 秦安县| 新宾| 青田县| 历史| 清流县| 泸定县| 十堰市| 曲周县| 西峡县| 华宁县| 通州区| 嫩江县| 绥化市| 延庆县| 浏阳市| 石狮市| 水城县| 南康市| 张家港市| 杭州市| 黄冈市| 鄂伦春自治旗| 郴州市| 新乡市| 荆州市| 齐河县| 府谷县| 航空| 清原| 仁怀市| 祁东县| 合山市| 皋兰县| 临高县| 河间市| 昌邑市| 城固县| 水富县| 濉溪县| 五华县|