用.net實(shí)現(xiàn)遠(yuǎn)程獲取其他網(wǎng)站頁面內(nèi)容

2024-07-10 13:11:27

字體：大中小

供稿：網(wǎng)友

注冊(cè)會(huì)員，創(chuàng)建你的web開發(fā)資料庫，

　　遠(yuǎn)程獲取網(wǎng)頁內(nèi)容.經(jīng)過一定的處理和靈活應(yīng)用,可以開發(fā)成成體系網(wǎng)站內(nèi)容采集系統(tǒng).通常也叫做"新聞小偷"一般來說.做內(nèi)容采集分為如下幾個(gè)大致的步驟:

　　1.遠(yuǎn)程獲取頁面的全部html源文本.

　　2.通過過濾處理,分析有效內(nèi)容文本.(通常用正則表達(dá)式來截取有效數(shù)據(jù))

　　3.將格式有效的數(shù)據(jù),根據(jù)自己的數(shù)據(jù)庫結(jié)構(gòu)分標(biāo)題,內(nèi)容....一些其他屬性保存到自己的本地?cái)?shù)據(jù)庫.

　　ok整個(gè)采集過程如此簡(jiǎn)單.原理也不難.下面我們看看實(shí)現(xiàn)的具體基礎(chǔ)代碼!

　　首先我們來寫一個(gè)獲取遠(yuǎn)程html源的方法.

public string gethttpdata(string url)
        {
            string sexception=null;
            string srslt=null;
            webresponse owebrps=null;
            webrequest owebrqst=webrequest.create(url);
            owebrqst.timeout=50000;
            try
            {
                owebrps=owebrqst.getresponse();
            }
            catch(webexception e)
            {
                    sexception=e.message.tostring();
                    eyresponse.write(sexception);
            }
            catch(exception e)
            {
                    sexception=e.tostring();
                    eyresponse.write(sexception);
            }
            finally
            {
                if(owebrps!=null)
                {
                    streamreader ostreamrd=new streamreader(owebrps.getresponsestream(),encoding.getencoding("gb2312"));
                    srslt=ostreamrd.readtoend();
                    ostreamrd.close();
                    owebrps.close();
                }
            }
            return srslt;
        }

　　以上代碼為獲取遠(yuǎn)程html源的一個(gè)方法.參數(shù)僅一個(gè).就是你要獲取的目標(biāo)頁面的完整url路徑.返回一個(gè)string類型的html源數(shù)據(jù).

　　下面我們?cè)賮砝^續(xù)第二個(gè)步驟.分析自己需要的有效數(shù)據(jù)!這里我假設(shè)某個(gè)頁面來做分析...

public string [] getdata(string html)
{
string [ ] rs=new string[2];
string s = html;
                s=regex.replace(s,"http://s{3,}","");
                s=s.replace("/r","");
                s=s.replace("/n","");
                string pat="<td align=/"center/" class=/"24p/"><b>(.*)</b></td></tr><tr>.*(<table width=/"95%/" border=/"0/" cellspacing=/"0/" cellpadding=/"10/">.*</table>)<table width=/"98%/" border=/"0/" cellspacing=/"0/" cellpadding=/"0/">(.*)<td align=center class=l6h>";
                regex re = new regex(pat);
                match ma= re.match(s);
                if(ma.success)
                {
                    rs[0]=ma.groups[1].tostring();
                    rs[1]=ma.groups[2].tostring();
                    pgstr=ma.groups[3].tostring();
                }
return rs;
}