利用Python搭建的簡(jiǎn)易排序搜索引擎

2019-11-08 19:46:06

字體：大中小

供稿：網(wǎng)友

本文源代碼轉(zhuǎn)自搜索引擎原理，博主進(jìn)行整理調(diào)BUG并進(jìn)行注釋?zhuān)瑢?duì)于Python初學(xué)者來(lái)說(shuō)是了解爬蟲(chóng)、網(wǎng)頁(yè)排序算法非常好的素材。

首先來(lái)介紹一下PageRank網(wǎng)頁(yè)排序算法（注：轉(zhuǎn)自PageRank算法簡(jiǎn)介及Map-Reduce實(shí)現(xiàn)，詳情點(diǎn)擊鏈接）：

PageRank對(duì)網(wǎng)頁(yè)排名的算法，曾是Google發(fā)家致富的法寶。以前雖然有實(shí)驗(yàn)過(guò)，但理解還是不透徹，這幾天又看了一下，這里總結(jié)一下PageRank算法的基本原理。

一、什么是pagerank

PageRank的Page可是認(rèn)為是網(wǎng)頁(yè)，表示網(wǎng)頁(yè)排名，也可以認(rèn)為是Larry Page(google 產(chǎn)品經(jīng)理)，因?yàn)樗沁@個(gè)算法的發(fā)明者之一，還是google CEO（^_^）。PageRank算法計(jì)算每一個(gè)網(wǎng)頁(yè)的PageRank值，然后根據(jù)這個(gè)值的大小對(duì)網(wǎng)頁(yè)的重要性進(jìn)行排序。它的思想是模擬一個(gè)悠閑的上網(wǎng)者，上網(wǎng)者首先隨機(jī)選擇一個(gè)網(wǎng)頁(yè)打開(kāi)，然后在這個(gè)網(wǎng)頁(yè)上呆了幾分鐘后，跳轉(zhuǎn)到該網(wǎng)頁(yè)所指向的鏈接，這樣無(wú)所事事、漫無(wú)目的地在網(wǎng)頁(yè)上跳來(lái)跳去，PageRank就是估計(jì)這個(gè)悠閑的上網(wǎng)者分布在各個(gè)網(wǎng)頁(yè)上的概率。

二、最簡(jiǎn)單pagerank模型

互聯(lián)網(wǎng)中的網(wǎng)頁(yè)可以看出是一個(gè)有向圖，其中網(wǎng)頁(yè)是結(jié)點(diǎn)，如果網(wǎng)頁(yè)A有鏈接到網(wǎng)頁(yè)B，則存在一條有向邊A->B，下面是一個(gè)簡(jiǎn)單的示例：

這個(gè)例子中只有四個(gè)網(wǎng)頁(yè)，如果當(dāng)前在A網(wǎng)頁(yè)，那么悠閑的上網(wǎng)者將會(huì)各以1/3的概率跳轉(zhuǎn)到B、C、D，這里的3表示A有3條出鏈，如果一個(gè)網(wǎng)頁(yè)有k條出鏈，那么跳轉(zhuǎn)任意一個(gè)出鏈上的概率是1/k，同理D到B、C的概率各為1/2，而B(niǎo)到C的概率為0。一般用轉(zhuǎn)移矩陣表示上網(wǎng)者的跳轉(zhuǎn)概率，如果用n表示網(wǎng)頁(yè)的數(shù)目，則轉(zhuǎn)移矩陣M是一個(gè)n*n的方陣；如果網(wǎng)頁(yè)j有k個(gè)出鏈，那么對(duì)每一個(gè)出鏈指向的網(wǎng)頁(yè)i，有M[i][j]=1/k，而其他網(wǎng)頁(yè)的M[i][j]=0；上面示例圖對(duì)應(yīng)的轉(zhuǎn)移矩陣如下：

初試時(shí)，假設(shè)上網(wǎng)者在每一個(gè)網(wǎng)頁(yè)的概率都是相等的，即1/n，于是初試的概率分布就是一個(gè)所有值都為1/n的n維列向量V0，用V0去右乘轉(zhuǎn)移矩陣M，就得到了第一步之后上網(wǎng)者的概率分布向量MV0,（nXn）*(nX1)依然得到一個(gè)nX1的矩陣。下面是V1的計(jì)算過(guò)程：

注意矩陣M中M[i][j]不為0表示用一個(gè)鏈接從j指向i，M的第一行乘以V0，表示累加所有網(wǎng)頁(yè)到網(wǎng)頁(yè)A的概率即得到9/24。得到了V1后，再用V1去右乘M得到V2，一直下去，最終V會(huì)收斂，即Vn=MV(n-1)，上面的圖示例，不斷的迭代，最終V=[3/9,2/9,2/9,2/9]’：

現(xiàn)在應(yīng)該能夠基本理解PageRank算法了，那么附上代碼：

#the search engine is divided into 3 modules:web crawl,build and use of index,page rank#----------------------------web_crawl--------------------------------def get_page(url):    try:        import urllib #引入urllib庫(kù)        return urllib.urlopen(url).read() #打開(kāi)網(wǎng)站并讀取    except:        return ""  #拋出異常def get_next_target(page): #獲取下一個(gè)鏈接網(wǎng)站    start_link = page.find('"') #str.find(str, beg=0 end=len(string)) 如果找到此方法返回的索引，否則返回-1;參數(shù)：此選項(xiàng)指定要搜索的字符串/這是開(kāi)始索引，默認(rèn)情況下為 0/這是結(jié)束索引，默認(rèn)情況下它等于字符串的長(zhǎng)度    if start_link == -1:        return None, 0    start_quote = page.find('"', start_link)    end_quote = page.find('"', start_quote + 1)    url = page[start_quote + 1:end_quote]    return url, end_quotedef get_all_links(page): #獲取網(wǎng)頁(yè)上面所有的url    links = [] #存取url的數(shù)組    while True:        url, endpos = get_next_target(page)        if url:            links.append(url)            page = page[endpos:]        else:            break    return linksdef union(a, b): #并查集    for e in b:        if e not in a:            a.append(e)def crawl_web(seed): # returns index, graph of inlinks    tocrawl = [seed]    crawled = []    graph = {}  # , [list of pages it links to]    index = {}    while tocrawl:        page = tocrawl.pop()        if page not in crawled:            content = get_page(page)            add_page_to_index(index, page, content)            outlinks = get_all_links(content)            graph[page] = outlinks            union(tocrawl, outlinks)            crawled.append(page)    return index, graph#--------------------------------build_index-----------------------------------def add_page_to_index(index, url, content):    Words = content.split() #str.split(str="", num=string.count(str)) 返回分割后的字符串列表    for word in words:        PRint word        add_to_index(index, word, url) def add_to_index(index, keyword, url):    if keyword in index:        index[keyword].append(url) #存儲(chǔ)url，類(lèi)似map    else:        index[keyword] = [url] def lookup(index, keyword): #查找索引    if keyword in index:        return index[keyword]    else:        return None    #---------------------------------page_rank---------------------------------def compute_ranks(graph):    d = 0.8 # damping factor    numloops = 10    ranks = {}    npages = len(graph)    for page in graph:         ranks[page] = 1.0 / npages    for i in range(0, numloops):        newranks = {}        for page in graph:            newrank = (1 - d) / npages            for node in graph:                if page in graph[node]:                    newrank = newrank + d * (ranks[node] / len(graph[node]))            newranks[page] = newrank        ranks = newranks    return ranksdef quick_sort(url_lst,ranks): #對(duì)網(wǎng)頁(yè)權(quán)值進(jìn)行快排, 不穩(wěn)定    url_sorted_worse=[]    url_sorted_better=[]    if len(url_lst)<=1:        return url_lst    pivot=url_lst[0]    for url in url_lst[1:]:        if ranks[url]<=ranks[pivot]:            url_sorted_worse.append(url)        else:            url_sorted_better.append(url)    return quick_sort(url_sorted_better,ranks)+[pivot]+quick_sort(url_sorted_worse,ranks)def ordered_search(index, ranks, keyword): #搜索結(jié)果排序    if keyword in index:        all_urls=index[keyword]    else:        return None    return quick_sort(all_urls,ranks)#---------------------------------test-----------------------------------index, graph = crawl_web('http://udacity.com/cs101x/urank/index.html')ranks = compute_ranks(graph)print ordered_search(index, ranks, 'Hummus')#>>> ['http://udacity.com/cs101x/urank/kathleen.html',#    'http://udacity.com/cs101x/urank/nickel.html',#    'http://udacity.com/cs101x/urank/arsenic.html',#    'http://udacity.com/cs101x/urank/hummus.html',#    'http://udacity.com/cs101x/urank/index.html']#print ordered_search(index, ranks, 'the')#>>> ['http://udacity.com/cs101x/urank/nickel.html',#    'http://udacity.com/cs101x/urank/arsenic.html',#    'http://udacity.com/cs101x/urank/hummus.html',#    'http://udacity.com/cs101x/urank/index.html']#print ordered_search(index, ranks, 'babaganoush')#>>> None
代碼經(jīng)調(diào)試可用，但性能有限，爬取的網(wǎng)頁(yè)會(huì)經(jīng)常出現(xiàn)亂碼，源網(wǎng)頁(yè)的編碼方式不得而知。亂碼的出現(xiàn)直接導(dǎo)致了最后輸出的網(wǎng)頁(yè)排名有缺漏，目前這個(gè)問(wèn)題還沒(méi)有解決，爭(zhēng)取在之后的調(diào)試中解決這個(gè)問(wèn)題。哈哈哈，祝大家新年快樂(lè)，好好學(xué)習(xí)，天天向上~<(￣ˇ￣)/<(￣ˇ￣)/

上一篇：python使用win32*模塊模擬人工操作——城通網(wǎng)盤(pán)下載器（一）

下一篇：python安裝requests庫(kù)