国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁(yè) > 學(xué)院 > 開發(fā)設(shè)計(jì) > 正文

.Net開源網(wǎng)絡(luò)爬蟲Abot介紹

2019-11-14 16:38:42
字體:
供稿:網(wǎng)友

.Net中也有很多很多開源的爬蟲工具,abot就是其中之一。Abot是一個(gè)開源的.net爬蟲,速度快,易于使用和擴(kuò)展。項(xiàng)目的地址是https://code.google.com/p/abot/

對(duì)于爬取的Html,使用的分析工具是CsQuery, CsQuery可以算是.net中實(shí)現(xiàn)的Jquery, 可以使用類似Jquery中的方法來處理html頁(yè)面。CsQuery的項(xiàng)目地址是https://github.com/afeiship/CsQuery

一. 對(duì)Abot爬蟲配置

1. 通過屬性設(shè)置

先創(chuàng)建config對(duì)象,然后設(shè)置config中的各項(xiàng)屬性:

CrawlConfiguration crawlConfig = new CrawlConfiguration(); crawlConfig.CrawlTimeoutSeconds = 100; crawlConfig.MaxConcurrentThreads = 10; crawlConfig.MaxPagesToCrawl = 1000; crawlConfig.UserAgentString = "abot v1.0 http://code.google.com/p/abot"; crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue1", "1111"); crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue2", "2222");

2. 通過App.config配置

直接從配置文件中讀取,但是也任然可以在修改各項(xiàng)屬性:

CrawlConfiguration crawlConfig = AbotConfigurationSectionHandler.LoadFromxml().Convert(); crawlConfig.CrawlTimeoutSeconds = 100; crawlConfig.MaxConcurrentThreads = 10;

3. 應(yīng)用配置到爬蟲對(duì)象

PoliteWebCrawler crawler = new PoliteWebCrawler();PoliteWebCrawler crawler = new PoliteWebCrawler(crawlConfig, null, null, null, null, null, null, null);

二,使用爬蟲,注冊(cè)各種事件

爬蟲中主要是4個(gè)事件, 頁(yè)面爬取開始、頁(yè)面爬取失敗、頁(yè)面不允許爬取事件、頁(yè)面中的鏈接不允許爬取事件。

下面是示例代碼:

crawlergeCrawlStartingAsync += crawler_PRocessPageCrawlStarting;//單個(gè)頁(yè)面爬取開始 crawler.PageCrawlCompletedAsync += crawler_ProcessPageCrawlCompleted;//單個(gè)頁(yè)面爬取結(jié)束 crawler.PageCrawlDisallowedAsync += crawler_PageCrawlDisallowed;//頁(yè)面不允許爬取事件 crawler.PageLinksCrawlDisallowedAsync += crawler_PageLinksCrawlDisallowed;//頁(yè)面鏈接不允許爬取事件void crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e){        PageToCrawl pageToCrawl = e.PageToCrawl;        Console.WriteLine("About to crawl link {0} which was found on page {1}", pageToCrawl.Uri.AbsoluteUri, pageToCrawl.ParentUri.AbsoluteUri);}void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e){        CrawledPage crawledPage = e.CrawledPage;        if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)                Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri);        else                Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri);        if (string.IsNullOrEmpty(crawledPage.Content.Text))                Console.WriteLine("Page had no content {0}", crawledPage.Uri.AbsoluteUri);}void crawler_PageLinksCrawlDisallowed(object sender, PageLinksCrawlDisallowedArgs e){        CrawledPage crawledPage = e.CrawledPage;        Console.WriteLine("Did not crawl the links on page {0} due to {1}", crawledPage.Uri.AbsoluteUri, e.DisallowedReason);}void crawler_PageCrawlDisallowed(object sender, PageCrawlDisallowedArgs e){        PageToCrawl pageToCrawl = e.PageToCrawl;        Console.WriteLine("Did not crawl page {0} due to {1}", pageToCrawl.Uri.AbsoluteUri, e.DisallowedReason);}

 

三, 為爬蟲添加多個(gè)附加對(duì)象

Abot應(yīng)該是借鑒了asp.net MVC中的ViewBag, 也為爬蟲對(duì)象設(shè)置了對(duì)象級(jí)別的CrwalBag和Page級(jí)別的ViewBag.

PoliteWebCrawler crawler = new PoliteWebCrawler();crawler.CrawlBag.MyFoo1 = new Foo();//對(duì)象級(jí)別的CrwalBagcrawler.CrawlBag.MyFoo2 = new Foo();crawler.PageCrawlStartingAsync += crawler_ProcessPageCrawlStarting;...void crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e){        //獲取CrwalBag中的對(duì)象        CrawlContext context = e.CrawlContext;        context.CrawlBag.MyFoo1.Bar();//使用CrwalBag        context.CrawlBag.MyFoo2.Bar();        //使用頁(yè)面級(jí)別的PageBag        e.PageToCrawl.PageBag.Bar = new Bar();}

四,啟動(dòng)爬蟲

啟動(dòng)爬蟲非常簡(jiǎn)單,調(diào)用Crawl方法,指定好開始頁(yè)面,就可以了。
CrawlResult result = crawler.Crawl(new Uri("http://localhost:1111/"));if (result.ErrorOccurred)        Console.WriteLine("Crawl of {0} completed with error: {1}", result.RootUri.AbsoluteUri, result.ErrorException.Message);else        Console.WriteLine("Crawl of {0} completed without error.", result.RootUri.AbsoluteUri);

五,介紹CsQuery

在PageCrawlCompletedAsync事件中, e.CrawledPage.CsQueryDocument就是一個(gè)CsQuery對(duì)象。

這里介紹一下CsQuery在分析Html上的優(yōu)勢(shì):

cqDocument.Select(".bigtitle > h1")
這里的選擇器的用法和Jquery完全相同,這里是取class為.bittitle下的h1標(biāo)簽。如果你能熟練的使用Jquery,那么上手CsQuery會(huì)非常快和容易。

發(fā)表評(píng)論 共有條評(píng)論
用戶名: 密碼:
驗(yàn)證碼: 匿名發(fā)表
主站蜘蛛池模板: 黄浦区| 洛阳市| 桦南县| 台安县| 抚松县| 宁阳县| 太原市| 天全县| 手机| 富锦市| 阳曲县| 柳州市| 扎鲁特旗| 赤水市| 平乡县| 马尔康县| 淮滨县| 体育| 云南省| 桓仁| 天门市| 肃南| 河北区| 溧水县| 志丹县| 绥芬河市| 湘西| 东乡族自治县| 德安县| 尤溪县| 金山区| 定边县| 白朗县| 株洲市| 延边| 英超| 边坝县| 通渭县| 新平| 平山县| 平顺县|