lda模型的python實(shí)現(xiàn)

2019-11-14 17:05:03

字體：大中小

供稿：網(wǎng)友

LDA（Latent Dirichlet Allocation）是一種文檔主題生成模型，最近看了點(diǎn)資料，準(zhǔn)備使用python實(shí)現(xiàn)一下。至于數(shù)學(xué)模型相關(guān)知識(shí)，某度一大堆，這里也給出之前參考過(guò)的一個(gè)挺詳細(xì)的文檔lda算法漫游指南
這篇博文只講算法的sampling方法python實(shí)現(xiàn)。
完整實(shí)現(xiàn)項(xiàng)目開(kāi)源python-LDA

lda模型變量申請(qǐng)及初始化

##偽代碼#輸入：文章集合（分詞處理后），K（類(lèi)的個(gè)數(shù)）輸出：已經(jīng)隨機(jī)分派了一次的lda模型begin    申請(qǐng)幾個(gè)統(tǒng)計(jì)量：        p 概率向量 維度:K        nw 詞在類(lèi)上的分布 維度：M*K 其中M為文章集合的詞的總個(gè)數(shù)        nwsum 每個(gè)類(lèi)上的詞的總數(shù) 維度:K        nd 每篇文章中，各個(gè)類(lèi)的詞個(gè)數(shù)分布 維度：V*K 其中V為文章的總個(gè)數(shù)        ndsum 每篇文章中的詞的總個(gè)數(shù) 維度：V        Z 每個(gè)詞分派一個(gè)類(lèi) 維度：V*每篇文章詞的個(gè)數(shù)        theta 文章->類(lèi)的概率分布 維度：V*K        phi 類(lèi)->詞的概率分布 維度：K*M    #初始化隨機(jī)分配類(lèi)    for x in 文章數(shù)：        統(tǒng)計(jì)ndsum[文章id][詞的個(gè)數(shù)]        for y in 每篇文章的詞個(gè)數(shù)：            給所有詞隨機(jī)分派一個(gè)類(lèi)            詞在此類(lèi)上的分布數(shù)目+1            此文章中此類(lèi)的詞的個(gè)數(shù)+1            此類(lèi)的總詞數(shù) +1end

##實(shí)現(xiàn)代碼片段，更詳細(xì)看github項(xiàng)目#class LDAModel(object):        def __init__(self,dPRe):        self.dpre = dpre #獲取預(yù)處理參數(shù)        #        #模型參數(shù)        #聚類(lèi)個(gè)數(shù)K，迭代次數(shù)iter_times,每個(gè)類(lèi)特征詞個(gè)數(shù)top_Words_num,超參數(shù)α（alpha） β(beta)        #        self.K = K        self.beta = beta        self.alpha = alpha        self.iter_times = iter_times        self.top_words_num = top_words_num         #        #文件變量        #分好詞的文件trainfile        #詞對(duì)應(yīng)id文件wordidmapfile        #文章-主題分布文件thetafile        #詞-主題分布文件phifile        #每個(gè)主題topN詞文件topNfile        #最后分派結(jié)果文件tassginfile        #模型訓(xùn)練選擇的參數(shù)文件paramfile        #        self.wordidmapfile = wordidmapfile        self.trainfile = trainfile        self.thetafile = thetafile        self.phifile = phifile        self.topNfile = topNfile        self.tassginfile = tassginfile        self.paramfile = paramfile        # p,概率向量 double類(lèi)型，存儲(chǔ)采樣的臨時(shí)變量        # nw,詞word在主題topic上的分布        # nwsum,每各topic的詞的總數(shù)        # nd,每個(gè)doc中各個(gè)topic的詞的總數(shù)        # ndsum,每各doc中詞的總數(shù)        self.p = np.zeros(self.K)                self.nw = np.zeros((self.dpre.words_count,self.K),dtype="int")               self.nwsum = np.zeros(self.K,dtype="int")            self.nd = np.zeros((self.dpre.docs_count,self.K),dtype="int")               self.ndsum = np.zeros(dpre.docs_count,dtype="int")            self.Z = np.array([ [0 for y in xrange(dpre.docs[x].length)] for x in xrange(dpre.docs_count)])        # M*doc.size()，文檔中詞的主題分布        #隨機(jī)先分配類(lèi)型        for x in xrange(len(self.Z)):            self.ndsum[x] = self.dpre.docs[x].length            for y in xrange(self.dpre.docs[x].length):                topic = random.randint(0,self.K-1)                self.Z[x][y] = topic                self.nw[self.dpre.docs[x].words[y]][topic] += 1                self.nd[x][topic] += 1                self.nwsum[topic] += 1        self.theta = np.array([ [0.0 for y in xrange(self.K)] for x in xrange(self.dpre.docs_count) ])        self.phi = np.array([ [ 0.0 for y in xrange(self.dpre.words_count) ] for x in xrange(self.K)])

sampling抽樣過(guò)程

##偽代碼#輸入：初始化后的lda_model,迭代次數(shù)iter_times，超參數(shù)α、β，聚類(lèi)個(gè)數(shù)K輸出：theta(文章對(duì)應(yīng)類(lèi)的分布概率)，phi（類(lèi)對(duì)應(yīng)詞的分布概率），tassgin（文章中每個(gè)詞的分派類(lèi)結(jié)果）,twords(每個(gè)類(lèi)topN個(gè)高頻詞)begin    for i in 迭代次數(shù):        for m in 文章個(gè)數(shù)：             for v in 文章中詞：                取topic = Z[m][v]                令nw[v][topic]、nwsum[topic]、nd[m][topic]的統(tǒng)計(jì)量均-1                計(jì)算概率p[] #p[]為此詞屬于每個(gè)topic的概率                for k in (1,類(lèi)的個(gè)數(shù)-1):                    p[k] += p[k-1]                再隨機(jī)分派一次，記錄被分派的新的topic                令nw[v][new_topic]、nwsum[new_topic]、nd[m][new_topic]的統(tǒng)計(jì)量均+1    #迭代完成后    輸出模型end

#代碼片段    def sampling(self,i,j):        topic = self.Z[i][j]        word = self.dpre.docs[i].words[j]        self.nw[word][topic] -= 1        self.nd[i][topic] -= 1        self.nwsum[topic] -= 1        self.ndsum[i] -= 1        Vbeta = self.dpre.words_count * self.beta        Kalpha = self.K * self.alpha        self.p = (self.nw[word] + self.beta)/(self.nwsum + Vbeta) * /                 (self.nd[i] + self.alpha) / (self.ndsum[i] + Kalpha)        for k in xrange(1,self.K):            self.p[k] += self.p[k-1]        u = random.uniform(0,self.p[self.K-1])        for topic in xrange(self.K):            if self.p[topic]>u:                break        self.nw[word][topic] +=1        self.nwsum[topic] +=1        self.nd[i][topic] +=1        self.ndsum[i] +=1        return topic