機器學(xué)習(xí)python實戰(zhàn)之決策樹

2020-01-04 16:26:34

字體：大中小

供稿：網(wǎng)友

決策樹原理：從數(shù)據(jù)集中找出決定性的特征對數(shù)據(jù)集進行迭代劃分，直到某個分支下的數(shù)據(jù)都屬于同一類型，或者已經(jīng)遍歷了所有劃分數(shù)據(jù)集的特征，停止決策樹算法。

　　每次劃分數(shù)據(jù)集的特征都有很多，那么我們怎么來選擇到底根據(jù)哪一個特征劃分數(shù)據(jù)集呢？這里我們需要引入信息增益和信息熵的概念。

一、信息增益

　　劃分數(shù)據(jù)集的原則是：將無序的數(shù)據(jù)變的有序。在劃分數(shù)據(jù)集之前之后信息發(fā)生的變化稱為信息增益。知道如何計算信息增益，我們就可以計算根據(jù)每個特征劃分數(shù)據(jù)集獲得的信息增益，選擇信息增益最高的特征就是最好的選擇。首先我們先來明確一下信息的定義：符號xi的信息定義為 l(xi)=-log2 p(xi)，p(xi)為選擇該類的概率。那么信息源的熵H=-∑p(xi)·log2 p(xi)。根據(jù)這個公式我們下面編寫代碼計算香農(nóng)熵

def calcShannonEnt(dataSet): NumEntries = len(dataSet) labelsCount = {} for i in dataSet:  currentlabel = i[-1]  if currentlabel not in labelsCount.keys():   labelsCount[currentlabel]=0  labelsCount[currentlabel]+=1 ShannonEnt = 0.0 for key in labelsCount:  prob = labelsCount[key]/NumEntries  ShannonEnt -= prob*log(prob,2) return ShannonEnt

上面的自定義函數(shù)我們需要在之前導(dǎo)入log方法，from math import log。我們可以先用一個簡單的例子來測試一下

def createdataSet(): #dataSet = [['1','1','yes'],['1','0','no'],['0','1','no'],['0','0','no']] dataSet = [[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,0,'no']] labels = ['no surfacing','flippers'] return dataSet,labels

python,決策樹

這里的熵為0.811，當(dāng)我們增加數(shù)據(jù)的類別時，熵會增加。這里更改后的數(shù)據(jù)集的類別有三種‘yes'、‘no'、‘maybe'，也就是說數(shù)據(jù)越混亂，熵就越大。

python,決策樹

分類算法出了需要計算信息熵，還需要劃分數(shù)據(jù)集。決策樹算法中我們對根據(jù)每個特征劃分的數(shù)據(jù)集計算一次熵，然后判斷按照哪個特征劃分是最好的劃分方式。

def splitDataSet(dataSet,axis,value): retDataSet = [] for featVec in dataSet:  if featVec[axis] == value:   reducedfeatVec = featVec[:axis]   reducedfeatVec.extend(featVec[axis+1:])   retDataSet.append(reducedfeatVec) return retDataSet

axis表示劃分數(shù)據(jù)集的特征，value表示特征的返回值。這里需要注意extend方法和append方法的區(qū)別。舉例來說明這個區(qū)別

python,決策樹

下面我們測試一下劃分數(shù)據(jù)集函數(shù)的結(jié)果：

python,決策樹

axis=0，value=1，按myDat數(shù)據(jù)集的第0個特征向量是否等于1進行劃分。

接下來我們將遍歷整個數(shù)據(jù)集，對每個劃分的數(shù)據(jù)集計算香農(nóng)熵，找到最好的特征劃分方式

def choosebestfeatureToSplit(dataSet): Numfeatures = len(dataSet)-1 BaseShannonEnt = calcShannonEnt(dataSet) bestInfoGain=0.0 bestfeature = -1 for i in range(Numfeatures):  featlist = [example[i] for example in dataSet]  featSet = set(featlist)  newEntropy = 0.0  for value in featSet:   subDataSet = splitDataSet(dataSet,i,value)   prob = len(subDataSet)/len(dataSet)   newEntropy += prob*calcShannonEnt(subDataSet)   infoGain = BaseShannonEnt-newEntropy  if infoGain>bestInfoGain:   bestInfoGain=infoGain   bestfeature = i return bestfeature

信息增益是熵的減少或數(shù)據(jù)無序度的減少。最后比較所有特征中的信息增益，返回最好特征劃分的索引。函數(shù)測試結(jié)果為

python,決策樹

接下來開始遞歸構(gòu)建決策樹，我們需要在構(gòu)建前計算列的數(shù)目，查看算法是否使用了所有的屬性。這個函數(shù)跟跟第二章的calssify0采用同樣的方法

def majorityCnt(classlist): ClassCount = {} for vote in classlist:  if vote not in ClassCount.keys():   ClassCount[vote]=0  ClassCount[vote]+=1 sortedClassCount = sorted(ClassCount.items(),key = operator.itemgetter(1),reverse = True) return sortedClassCount[0][0]def createTrees(dataSet,labels): classList = [example[-1] for example in dataSet] if classList.count(classList[0]) == len(classList):  return classList[0] if len(dataSet[0])==1:  return majorityCnt(classList) bestfeature = choosebestfeatureToSplit(dataSet) bestfeatureLabel = labels[bestfeature] myTree = {bestfeatureLabel:{}} del(labels[bestfeature]) featValue = [example[bestfeature] for example in dataSet] uniqueValue = set(featValue) for value in uniqueValue:  subLabels = labels[:]  myTree[bestfeatureLabel][value] = createTrees(splitDataSet(dataSet,bestfeature,value),subLabels) return myTree

最終決策樹得到的結(jié)果如下：

python,決策樹

有了如上的結(jié)果，我們看起來并不直觀，所以我們接下來用matplotlib注解繪制樹形圖。matplotlib提供了一個注解工具annotations，它可以在數(shù)據(jù)圖形上添加文本注釋。我們先來測試一下這個注解工具的使用。

import matplotlib.pyplot as pltdecisionNode = dict(boxstyle = 'sawtooth',fc = '0.8')leafNode = dict(boxstyle = 'sawtooth',fc = '0.8')arrow_args = dict(arrowstyle = '<-')def plotNode(nodeTxt,centerPt,parentPt,nodeType): createPlot.ax1.annotate(nodeTxt,xy = parentPt,xycoords = 'axes fraction',/       xytext = centerPt,textcoords = 'axes fraction',/       va = 'center',ha = 'center',bbox = nodeType,/       arrowprops = arrow_args) def createPlot(): fig = plt.figure(1,facecolor = 'white') fig.clf() createPlot.ax1 = plt.subplot(111,frameon = False) plotNode('test1',(0.5,0.1),(0.1,0.5),decisionNode) plotNode('test2',(0.8,0.1),(0.3,0.8),leafNode) plt.show()

python,決策樹

測試過這個小例子之后我們就要開始構(gòu)建注解樹了。雖然有xy坐標(biāo)，但在如何放置樹節(jié)點的時候我們會遇到一些麻煩。所以我們需要知道有多少個葉節(jié)點，樹的深度有多少層。下面的兩個函數(shù)就是為了得到葉節(jié)點數(shù)目和樹的深度，兩個函數(shù)有相同的結(jié)構(gòu)，從第一個關(guān)鍵字開始遍歷所有的子節(jié)點，使用type()函數(shù)判斷子節(jié)點是否為字典類型，若為字典類型，則可以認為該子節(jié)點是一個判斷節(jié)點，然后遞歸調(diào)用函數(shù)getNumleafs()，使得函數(shù)遍歷整棵樹，并返回葉子節(jié)點數(shù)。第2個函數(shù)getTreeDepth()計算遍歷過程中遇到判斷節(jié)點的個數(shù)。該函數(shù)的終止條件是葉子節(jié)點，一旦到達葉子節(jié)點，則從遞歸調(diào)用中返回，并將計算樹深度的變量加一

def getNumleafs(myTree): numLeafs=0 key_sorted= sorted(myTree.keys()) firstStr = key_sorted[0] secondDict = myTree[firstStr] for key in secondDict.keys():  if type(secondDict[key]).__name__=='dict':   numLeafs+=getNumleafs(secondDict[key])  else:   numLeafs+=1 return numLeafsdef getTreeDepth(myTree): maxdepth=0 key_sorted= sorted(myTree.keys()) firstStr = key_sorted[0] secondDict = myTree[firstStr] for key in secondDict.keys():  if type(secondDict[key]).__name__ == 'dict':   thedepth=1+getTreeDepth(secondDict[key])  else:   thedepth=1  if thedepth>maxdepth:   maxdepth=thedepth return maxdepth

測試結(jié)果如下

python,決策樹

我們先給出最終的決策樹圖來驗證上述結(jié)果的正確性

python,決策樹

可以看出樹的深度確實是有兩層，葉節(jié)點的數(shù)目是3。接下來我們給出繪制決策樹圖的關(guān)鍵函數(shù)，結(jié)果就得到上圖中決策樹。

def plotMidText(cntrPt,parentPt,txtString): xMid = (parentPt[0]-cntrPt[0])/2.0+cntrPt[0] yMid = (parentPt[1]-cntrPt[1])/2.0+cntrPt[1] createPlot.ax1.text(xMid,yMid,txtString) def plotTree(myTree,parentPt,nodeTxt): numLeafs = getNumleafs(myTree) depth = getTreeDepth(myTree) key_sorted= sorted(myTree.keys()) firstStr = key_sorted[0] cntrPt = (plotTree.xOff+(1.0+float(numLeafs))/2.0/plotTree.totalW,plotTree.yOff) plotMidText(cntrPt,parentPt,nodeTxt) plotNode(firstStr,cntrPt,parentPt,decisionNode) secondDict = myTree[firstStr] plotTree.yOff -= 1.0/plotTree.totalD for key in secondDict.keys():  if type(secondDict[key]).__name__ == 'dict':   plotTree(secondDict[key],cntrPt,str(key))  else:   plotTree.xOff+=1.0/plotTree.totalW   plotNode(secondDict[key],(plotTree.xOff,plotTree.yOff),cntrPt,leafNode)   plotMidText((plotTree.xOff,plotTree.yOff),cntrPt,str(key)) plotTree.yOff+=1.0/plotTree.totalD def createPlot(inTree): fig = plt.figure(1,facecolor = 'white') fig.clf() axprops = dict(xticks = [],yticks = []) createPlot.ax1 = plt.subplot(111,frameon = False,**axprops) plotTree.totalW = float(getNumleafs(inTree)) plotTree.totalD = float(getTreeDepth(inTree)) plotTree.xOff = -0.5/ plotTree.totalW; plotTree.yOff = 1.0 plotTree(inTree,(0.5,1.0),'') plt.show()

以上就是本文的全部內(nèi)容，希望對大家的學(xué)習(xí)有所幫助，也希望大家多多支持VEVB武林網(wǎng)。

注：相關(guān)教程知識閱讀請移步到python教程頻道。

上一篇：詳解Python開發(fā)中如何使用Hook技巧

下一篇：Python定時器實例代碼