国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 編程 > Python > 正文

數據挖掘之Apriori算法詳解和Python實現代碼分享

2019-11-25 18:06:37
字體:
來源:轉載
供稿:網友

關聯規則挖掘(Association rule mining)是數據挖掘中最活躍的研究方法之一,可以用來發現事情之間的聯系,最早是為了發現超市交易數據庫中不同的商品之間的關系。(啤酒與尿布)

基本概念

1、支持度的定義:support(X-->Y) = |X交Y|/N=集合X與集合Y中的項在一條記錄中同時出現的次數/數據記錄的個數。例如:support({啤酒}-->{尿布}) = 啤酒和尿布同時出現的次數/數據記錄數 = 3/5=60%。

2、自信度的定義:confidence(X-->Y) = |X交Y|/|X| = 集合X與集合Y中的項在一條記錄中同時出現的次數/集合X出現的個數 。例如:confidence({啤酒}-->{尿布}) = 啤酒和尿布同時出現的次數/啤酒出現的次數=3/3=100%;confidence({尿布}-->{啤酒}) = 啤酒和尿布同時出現的次數/尿布出現的次數 = 3/4 = 75%

同時滿足最小支持度閾值(min_sup)和最小置信度閾值(min_conf)的規則稱作強規則 ,如果項集滿足最小支持度,則稱它為頻繁項集

“如何由大型數據庫挖掘關聯規則?”關聯規則的挖掘是一個兩步的過程:

1、找出所有頻繁項集:根據定義,這些項集出現的頻繁性至少和預定義的最小支持計數一樣。
2、由頻繁項集產生強關聯規則:根據定義,這些規則必須滿足最小支持度和最小置信度。

Apriori定律

為了減少頻繁項集的生成時間,我們應該盡早的消除一些完全不可能是頻繁項集的集合,Apriori的兩條定律就是干這事的。

Apriori定律1:如果一個集合是頻繁項集,則它的所有子集都是頻繁項集。舉例:假設一個集合{A,B}是頻繁項集,即A、B同時出現在一條記錄的次數大于等于最小支持度min_support,則它的子集{A},{B}出現次數必定大于等于min_support,即它的子集都是頻繁項集。

Apriori定律2:如果一個集合不是頻繁項集,則它的所有超集都不是頻繁項集。舉例:假設集合{A}不是頻繁項集,即A出現的次數小于min_support,則它的任何超集如{A,B}出現的次數必定小于min_support,因此其超集必定也不是頻繁項集。

上面的圖演示了Apriori算法的過程,注意看由二級頻繁項集生成三級候選項集時,沒有{牛奶,面包,啤酒},那是因為{面包,啤酒}不是二級頻繁項集,這里利用了Apriori定理。最后生成三級頻繁項集后,沒有更高一級的候選項集,因此整個算法結束,{牛奶,面包,尿布}是最大頻繁子集。

Python實現代碼:

復制代碼 代碼如下:

Skip to content
Sign up Sign in This repository
Explore
Features
Enterprise
Blog
 Star 0  Fork 0 taizilongxu/datamining
 branch: master  datamining / apriori / apriori.py
hackerxutaizilongxu 20 days ago backup
1 contributor
156 lines (140 sloc)  6.302 kb RawBlameHistory  
#-*- encoding: UTF-8 -*-
#---------------------------------import------------------------------------
#---------------------------------------------------------------------------
class Apriori(object):

    def __init__(self, filename, min_support, item_start, item_end):
        self.filename = filename
        self.min_support = min_support # 最小支持度
        self.min_confidence = 50
        self.line_num = 0 # item的行數
        self.item_start = item_start #  取哪行的item
        self.item_end = item_end

        self.location = [[i] for i in range(self.item_end - self.item_start + 1)]
        self.support = self.sut(self.location)
        self.num = list(sorted(set([j for i in self.location for j in i])))# 記錄item

        self.pre_support = [] # 保存前一個support,location,num
        self.pre_location = []
        self.pre_num = []

        self.item_name = [] # 項目名
        self.find_item_name()
        self.loop()
        self.confidence_sup()

    def deal_line(self, line):
        "提取出需要的項"
        return [i.strip() for i in line.split(' ') if i][self.item_start - 1:self.item_end]

    def find_item_name(self):
        "根據第一行抽取item_name"
        with open(self.filename, 'r') as F:
            for index,line in enumerate(F.readlines()):
                if index == 0:
                    self.item_name = self.deal_line(line)
                    break

    def sut(self, location):
        """
        輸入[[1,2,3],[2,3,4],[1,3,5]...]
        輸出每個位置集的support [123,435,234...]
        """
        with open(self.filename, 'r') as F:
            support = [0] * len(location)
            for index,line in enumerate(F.readlines()):
                if index == 0: continue
                # 提取每信息
                item_line = self.deal_line(line)
                for index_num,i in enumerate(location):
                    flag = 0
                    for j in i:
                        if item_line[j] != 'T':
                            flag = 1
                            break
                    if not flag:
                        support[index_num] += 1
            self.line_num = index # 一共多少行,出去第一行的item_name
        return support

    def select(self, c):
        "返回位置"
        stack = []
        for i in self.location:
            for j in self.num:
                if j in i:
                    if len(i) == c:
                        stack.append(i)
                else:
                    stack.append([j] + i)
        # 多重列表去重
        import itertools
        s = sorted([sorted(i) for i in stack])
        location = list(s for s,_ in itertools.groupby(s))
        return location

    def del_location(self, support, location):
        "清除不滿足條件的候選集"
        # 小于最小支持度的剔除
        for index,i in enumerate(support):
            if i < self.line_num * self.min_support / 100:
                support[index] = 0
        # apriori第二條規則,剔除
        for index,j in enumerate(location):
            sub_location = [j[:index_loc] + j[index_loc+1:]for index_loc in range(len(j))]
            flag = 0
            for k in sub_location:
                if k not in self.location:
                    flag = 1
                    break
            if flag:
                support[index] = 0
        # 刪除沒用的位置
        location = [i for i,j in zip(location,support) if j != 0]
        support = [i for i in support if i != 0]
        return support, location

    def loop(self):
        "s級頻繁項級的迭代"
        s = 2
        while True:
            print '-'*80
            print 'The' ,s - 1,'loop'
            print 'location' , self.location
            print 'support' , self.support
            print 'num' , self.num
            print '-'*80

            # 生成下一級候選集
            location = self.select(s)
            support = self.sut(location)
            support, location = self.del_location(support, location)
            num = list(sorted(set([j for i in location for j in i])))
            s += 1
            if  location and support and num:
                self.pre_num = self.num
                self.pre_location = self.location
                self.pre_support = self.support

                self.num = num
                self.location = location
                self.support = support
            else:
                break

    def confidence_sup(self):
        "計算confidence"
        if sum(self.pre_support) == 0:
            print 'min_support error' # 第一次迭代即失敗
        else:
            for index_location,each_location in enumerate(self.location):
                del_num = [each_location[:index] + each_location[index+1:] for index in range(len(each_location))] # 生成上一級頻繁項級
                del_num = [i for i in del_num if i in self.pre_location] # 刪除不存在上一級頻繁項級子集
                del_support = [self.pre_support[self.pre_location.index(i)] for i in del_num if i in self.pre_location] # 從上一級支持度查找
                # print del_num
                # print self.support[index_location]
                # print del_support
                for index,i in enumerate(del_num): # 計算每個關聯規則支持度和自信度
                    index_support = 0
                    if len(self.support) != 1:
                        index_support = index
                    support =  float(self.support[index_location])/self.line_num * 100 # 支持度
                    s = [j for index_item,j in enumerate(self.item_name) if index_item in i]
                    if del_support[index]:
                        confidence = float(self.support[index_location])/del_support[index] * 100
                        if confidence > self.min_confidence:
                            print ','.join(s) , '->>' , self.item_name[each_location[index]] , ' min_support: ' , str(support) + '%' , ' min_confidence:' , str(confidence) + '%'

def main():
    c = Apriori('basket.txt', 14, 3, 13)
    d = Apriori('simple.txt', 50, 2, 6)

if __name__ == '__main__':
    main()
############################################################################
Status API Training Shop Blog About
© 2014 GitHub, Inc. Terms Privacy Security Contact

Apriori算法

Apriori(filename, min_support, item_start, item_end)

參數說明

filename:(路徑)文件名
min_support:最小支持度
item_start:item起始位置
item_end:item結束位置

使用例子:

復制代碼 代碼如下:

import apriori
c = apriori.Apriori('basket.txt', 11, 3, 13)

輸出:

復制代碼 代碼如下:

--------------------------------------------------------------------------------
The 1 loop
location [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
support [299, 183, 177, 303, 204, 302, 293, 287, 184, 292, 276]
num [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 2 loop
location [[0, 9], [3, 5], [3, 6], [5, 6], [7, 10]]
support [145, 173, 167, 170, 144]
num [0, 3, 5, 6, 7, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 3 loop
location [[3, 5, 6]]
support [146]
num [3, 5, 6]
--------------------------------------------------------------------------------
frozenmeal,beer ->> cannedveg  min_support:  14.6%  min_confidence: 0.858823529412
cannedveg,beer ->> frozenmeal  min_support:  14.6%  min_confidence: 0.874251497006
cannedveg,frozenmeal ->> beer  min_support:  14.6%  min_confidence: 0.843930635838
--------------------------------------------------------------------------------

發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
主站蜘蛛池模板: 蒙阴县| 贵溪市| 平遥县| 文山县| 九龙县| 自贡市| 南溪县| 准格尔旗| 信丰县| 探索| 松溪县| 聂拉木县| 泾源县| 离岛区| 安多县| 鹤壁市| 方山县| 彭阳县| 桂东县| 中山市| 孟村| 沂水县| 岚皋县| 松潘县| 靖远县| 怀远县| 新乡市| 信阳市| 新乡县| 白城市| 沁水县| 青海省| 永川市| 浠水县| 乌苏市| 牡丹江市| 那坡县| 蓝田县| 柳江县| 荔波县| 定结县|