python極客學院爬蟲V1

2019-11-14 17:01:31

字體：大中小

來源：轉載

供稿：網友

定向爬取極客學院視頻，原本只有年費Vip只能下載，經過分析，只要找個免費體驗VIP即可爬取所有視頻
涉及的基本技術：python xpath 正則 com+
通過python調用迅雷從組件，實現自動創建文件夾和自動添加批量下載任務，前提要成功安裝迅雷和迅雷組件
思路：path路徑爬取所有標簽-》搜索頁面所有該課程分類-》課程頁面獲取課程明細-》正則分析視頻地址
極客學院的一直在改進，可能需要自己改進

import requests from lxml import etree import re import sys, os, glob,time import scrapy

reload(sys) sys.setdefaultencoding("utf-8")

#baesurl = "http://www.jikexueyuan.com/search/s/q_"

#base_path = "f:/jike/"

#heanders Cookie需要自己抓取，否則只能抓取到免費課程

headers = { "Host": "www.jikexueyuan.com", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8", "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3", "Accept-Encoding": "gzip, deflate", "Cookie": "ga=GA1.2.1700377703.1438173034; Hmlvtf3c68d41bda15331608595c98e9c3915=1438173034; MECHATLVTime=1438179151498; MECHATCKID=cookieVal=006600143817303272961295; statssid=1438985023415; statuuid=1438173038588973692017; connect.sid=s%3AWt8IWWxkVZ6zlhop7HpbG-vtXqtwIAs.QC1tYy4qV1bHOMDN0UTUfScLKFncl4NY5zAk1SS17Kw; QINGCLOUDELB=37e16e60f0cd051b754b0acf9bdfd4b5d562b81daa2a899c46d3a1e304c7eb2b|VbjfT|VbjfT; Hmlpvtf3c68d41bda15331608595c98e9c3915=1438179151; statisNew=0; statfromWebUrl=; gat=1; uname=jike76; uid=2992598; code=SMapFI; authcode=d572TzIvHFXNIVNXcNf4vI5lv1tQlyEknAG4m0mDQmvMRPa4VhDOtJXOSfO%2BeVFVPzra8M1sEkEzxqLX9qRgS6nWhd5VMobbDpeqvJ726i54TqMoDo81P4OlhQ", "Connection": "keep-alive" }

class jikeautodown: basepath = "" baseurl = "" coursetag = "" courseid = ""

def __init__(self, base_path, base_url):    if base_path and base_url:        self.base_path = base_path        self.base_url = base_url        self.get_tags()    else:

`get_tags 獲取所有便簽`

def get_tags(self):    url = "http://www.jikexueyuan.com/path/"    tag_html = requests.get(url).text.decode("utf-8").encode("GB18030")    tag_etree = etree.HTML(tag_html)    tag_lists = [str(tag).rstrip("/")[str(tag).rstrip("/").rindex("/") + 1:] for tag in                 tag_etree.xpath('/html/body/div[1]/div[4]/div/div[3]/div/a/@href') if tag]    if tag_lists:        for tag in tag_lists:            print(tag)            self.course_tag = tag            self.get_total_page(tag)

`get_tags 獲取課程所有頁面課程分頁是js生成不好直接抓取，所以就暴力了`

def get_total_page(self, tag):    if tag:        for page in range(1, 50):            page_url = self.base_url + tag + "?pageNum=%d" % page            # print(page_url)            page_html = requests.get(page_url, headers=headers).text.decode("utf-8").encode("GB18030")            # print(page_html)            no_userMenu = re.search(r"userMenu", page_html, re.S)            if no_userMenu is None:                print("please check the cookies")                return            no_search = re.search(r"no-search", page_html, re.S)            if no_search:                print("the tag ;%s,%d is biggest page" % (tag, page - 1))                # return page_url_lists                break            else:                # page_url_lists.append(page_url)                self.get_course_pages(page_url)                # print(page_url)

`getcoursepages 獲取課程詳細頁面`

def get_course_pages(self, tag_url):    if tag_url:        print("the tag_url:%s " % tag_url)        course_page_lists = self.get_xpath_lists(tag_url, headers,                                                 '//*[@id="changeid"]/ul/li/div/div[2]/h5/a/@href')        if course_page_lists:            for course_page_url in course_page_lists:                self.get_down_urls(course_page_url)

`getdownurls通過正則獲取視頻下載地址`

def get_down_urls(self, course_page_url):    if course_page_url:        self.course_id = course_page_url[course_page_url.rindex("/") + 1:course_page_url.rindex(".")]        # print(course_page_url)        print("             course_id:%s %s" % (self.course_id, course_page_url))        course_down_lists = self.get_xpath_lists(course_page_url, headers,                                                 '//*[@class="video-list"]/div[2]/ul/li/div/h2/a/@href')        if course_down_lists:            for course_down_url in course_down_lists:                course_down_html = requests.get(course_down_url, headers=headers).text.decode("utf-8").encode(                    "GB18030")                course_down = re.findall(r'source src="(.*?)"', course_down_html, re.S)                if course_down:                    print("                     %s" % course_down[0])                    if self.addTasktoXunlei(course_down[0]):                        # print("                     %s is add success!" % course_down[0])                        print("                     is add success!")                        time.sleep(5)

`getfilelists創建文件夾`

def get_file_lists(self, course_tag, course_id):    course_path = ""    if self.base_path and os.path.exists(self.base_path) == False:        try:            os.mkdir(self.base_path)        except Exception:            print("error :%s" % Exception.message)            return    if course_tag and os.path.exists(self.base_path + course_tag) == False:        try:            os.mkdir(self.base_path + course_tag)            # print("%s dir is create success!" % (self.base_path + course_tag))        except Exception:            print("dir is create error,the error is %s" % Exception.message)    tmp = self.base_path + course_tag + "http://" + str(course_id)    if course_id and os.path.exists(tmp) == False:        try:            os.mkdir(tmp)            course_path = tmp            # print("%s dir is create success!" % tmp)        except Exception:            print("dir is create error,the error is %s" % Exception.message)            return    else:        course_path = tmp    return course_path

`getxpathlists 專門解析xpath，不用每次都寫`

def get_xpath_lists(self, url, headers, xpath):    try:        html = requests.get(url, headers=headers).text.decode("utf-8").encode("GB18030")        tree = etree.HTML(html)        lists = [str(plist) for plist in tree.xpath(xpath) if plist]    except Exception:        print("get xpath list is error is :%s" % Exception.message)        return    return lists

`addTasktoXunlei 添加迅雷任，必須安裝迅雷，還需要對迅雷設置默認不提醒，否則就需要手動點擊確定了`

def addTasktoXunlei(self, down_url):    flag = False    from win32com.client import Dispatch    o = Dispatch("ThunderAgent.Agent.1")    # http: // cv3.jikexueyuan.com / 201508011650 / a396d5f2b9a19e8438da3ea888e4cc73 / python / course_776 / 01 / video / c776b_01_h264_sd_960_540.mp4    if down_url:        course_infos = str(down_url).replace(" ", "").replace("http://", "").split("/")        course_path = self.get_file_lists(self.course_tag, self.course_id)        try:            o.AddTask(down_url, course_infos[len(course_infos)-1], course_path, "", "http://cv3.jikexueyuan.com", 1, 0, 5)            o.CommitTasks()            flag = True        except Exception:            print(Exception.message)            print("                     AddTask is fail!")    return flagif __name__ == "__main__":    myjike = jike_auto_down("f://jike//", "http://www.jikexueyuan.com/search/s/q_")    myjike.run()








上一篇：pythonscrapy版極客學院爬蟲V2


下一篇：基于requests實現極客學院課程爬蟲














發表評論
共有條評論






用戶名:

密碼:



驗證碼:

 

匿名發表


















學習交流
更多





索泰發布一款GTX 1070 Mini迷


AMD新旗艦顯卡輕松干翻NVIDIA 






索泰發布一款GTX 1070 Mini迷你版本:小機
索泰發布一款GTX 1070 Mini迷你版本:小機箱大愛...






usb無線網卡怎么用,小編告訴你安裝教程09-10

usb調試在哪,小編告訴你usb調試在哪09-10

優盤不顯示,小編告訴你優盤不顯示怎么辦09-10

低級格式化,小編告訴你硬盤怎么低級格式化09-10




帝國cms分類信息的所在地在的修改09-08

將網站地圖和友情鏈接table樣式改為div+css09-08

用帝國cms實現不規則新聞或信息調用（應大站09-08

帝國調用DZ論壇精華帖09-08

用靈動標簽調用discuz和phpwind的最新貼子09-08







熱門圖片
更多




芭蕾舞蹈表演，真實美到極致


下午茶時間，悠然自得的休憩




充斥這繁華奢靡氣息的城市迪拜風景圖片


從山間到田野再到大海美麗的自然風景圖片




肉食主義者的最愛美食烤肉圖片


夏日甜心草莓美食圖片




人逢知己千杯少，喝酒搞笑圖集


搞笑試卷，學生惡搞答題







猜你喜歡的新聞


榮耀總裁趙明烏鎮演講：榮耀首款5G手機V30下

搜狐張朝陽：回歸媒體是搜狐重新崛起的關鍵

華為輪值董事長郭平：虛擬技術創造現實價值

第六屆世界互聯網大會開幕“to B”端成熱門

滴滴英文服務上線兩周年 用戶已超200萬

華為推出全球至快AI訓練集群Atlas900

馬斯克：特斯拉正組建中國技術團隊

10年后6G將問世 速度有望比5G快100倍

WeworkCEO稱已開始考慮未來職位 不排除放棄

谷歌軟件商店模式變革：推出5美元會員 可用數





猜你喜歡的關注


聊天室實現私聊(一)

聊天室實現私聊(二)

聊天室實現私聊(三)

聊天室實現私聊(四)

網頁在線人數統計的做法

ADO存取數據時如何實現留言記錄的分頁顯示

一個簡單聊天室的建立.(供學習參考)

構建你的網站新聞自動發布系統之一

構建你的網站新聞自動發布系統之二

構建你的網站新聞自動發布系統之三











新聞熱點





榮耀總裁趙明烏鎮演講：榮耀首款5G手機V30下月發布
2019-10-23 09:17:05






搜狐張朝陽：回歸媒體是搜狐重新崛起的關鍵
2019-10-21 09:20:02






華為輪值董事長郭平：虛擬技術創造現實價值
2019-10-21 09:00:12






滴滴英文服務上線兩周年 用戶已超200萬
2019-09-26 08:57:12






華為推出全球至快AI訓練集群Atlas900
2019-09-25 08:46:36






馬斯克：特斯拉正組建中國技術團隊
2019-09-25 08:15:43











疑難解答




索泰發布一款GTX 1070 Mini迷你版本:小機箱

AMD新旗艦顯卡輕松干翻NVIDIA 有幾個點我們

i5 6500配什么顯卡最佳？i5 6500配1060顯卡可

AMD新一批顯卡曝光:更便宜的14nm北極星

A卡自修改BIOS安裝16.12.1 ReLive驅動教程 

2016筆記本顯卡性能哪個好？筆記本顯卡天梯圖

2016顯卡性能怎么看好壞 顯卡天梯圖2016年1

PS4 Pro顯卡解析:顯存帶寬相當于標準版PS4

iGame 1050烈焰戰神U-2GD5版圖賞版:最美非

EVGA FTW GTX 1080/1070顯卡存在嚴重問題:







圖片精選




使用ASP建設私人搜索引擎



華為短消息中心的發展與應用



移動通信計費及客戶服務系統



移動客戶服務中心系統











網友關注




u盤無法識別怎么辦,小編告訴你U盤無法識別怎

usb無線網卡怎么用,小編告訴你安裝教程

usb調試在哪,小編告訴你usb調試在哪

優盤不顯示,小編告訴你優盤不顯示怎么辦

低級格式化,小編告訴你硬盤怎么低級格式化

分區表丟失,小編告訴你分區表丟失如何修復

進入bios,小編告訴你戴爾筆記本進入bios設置u

怎么刷bios,小編告訴你華碩怎么刷bios

讀卡器怎么用,小編告訴你如何使用讀卡器

bios升級,小編告訴你華碩主板bios怎么升級

国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

python極客學院爬蟲V1

#heanders Cookie需要自己抓取，否則只能抓取到免費課程

get_tags 獲取所有便簽

get_tags 獲取課程所有頁面 課程分頁是js生成不好直接抓取，所以就暴力了

getcoursepages 獲取課程詳細頁面

getdownurls通過正則獲取視頻下載地址

getfilelists創建文件夾

getxpathlists 專門解析xpath，不用每次都寫