最近鄰法和k-近鄰法
下面圖片中只有三種豆,有三個豆是未知的種類,如何判定他們的種類?

提供一種思路,即:未知的豆離哪種豆最近就認為未知豆和該豆是同一種類。由此,我們引出最近鄰算法的定義:為了判定未知樣本的類別,以全部訓練樣本作為代表點,計算未知樣本與所有訓練樣本的距離,并以最近鄰者的類別作為決策未知樣本類別的唯一依據。但是,最近鄰算法明顯是存在缺陷的,比如下面的例子:有一個未知形狀(圖中綠色的圓點),如何判斷它是什么形狀?

顯然,最近鄰算法的缺陷——對噪聲數據過于敏感,為了解決這個問題,我們可以可以把未知樣本周邊的多個最近樣本計算在內,擴大參與決策的樣本量,以避免個別數據直接決定決策結果。由此,我們引進K-最近鄰算法。K-最近鄰算法是最近鄰算法的一個延伸。基本思路是:選擇未知樣本一定范圍內確定個數的K個樣本,該K個樣本大多數屬于某一類型,則未知樣本判定為該類型。如何選擇一個最佳的K值取決于數據。一般情況下,在分類時較大的K值能夠減小噪聲的影響,但會使類別之間的界限變得模糊。待測樣本(綠色圓圈)既可能分到紅色三角形類,也可能分到藍色正方形類。如果k取3,從圖可見,待測樣本的3個鄰居在實線的內圓里,按多數投票結果,它屬于紅色三角形類。但是如果k取5,那么待測樣本的最鄰近的5個樣本在虛線的圓里,按表決法,它又屬于藍色正方形類。在實際應用中,K先取一個比較小的數值,再采用交叉驗證法來逐步調整K值,最終選擇適合該樣本的最優的K值。
KNN算法實現 
算法基本步驟:
1)計算待分類點與已知類別的點之間的距離
2)按照距離遞增次序排序
3)選取與待分類點距離最小的k個點
4)確定前k個點所在類別的出現次數
5)返回前k個點出現次數最高的類別作為待分類點的預測分類
下面是一個按照算法基本步驟用python實現的簡單例子,根據已分類的4個樣本點來預測未知點(圖中的灰點)的分類:

from numpy import * # create a dataset which contains 4 samples with 2 classes def createDataSet():   # create a matrix: each row as a sample   group = array([[1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]])   labels = ['A', 'A', 'B', 'B'] # four samples and two classes   return group, labels# classify using kNN (k Nearest Neighbors ) # Input:   newInput: 1 x N#       dataSet: M x N (M samples N, features)#       labels:  1 x M  #       k: number of neighbors to use for comparison # Output:   the most popular class label  def kNNClassify(newInput, dataSet, labels, k):   numSamples = dataSet.shape[0] # shape[0] stands for the num of row   ## step 1: calculate Euclidean distance   # tile(A, reps): Construct an array by repeating A reps times   # the following copy numSamples rows for dataSet   diff = tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise   squaredDiff = diff ** 2 # squared for the subtract   squaredDist = sum(squaredDiff, axis = 1) # sum is performed by row   distance = squaredDist ** 0.5   ## step 2: sort the distance   # argsort() returns the indices that would sort an array in a ascending order   sortedDistIndices = argsort(distance)   classCount = {} # define a dictionary (can be append element)   for i in xrange(k):     ## step 3: choose the min k distance     voteLabel = labels[sortedDistIndices[i]]     ## step 4: count the times labels occur     # when the key voteLabel is not in dictionary classCount, get()     # will return 0     classCount[voteLabel] = classCount.get(voteLabel, 0) + 1   ## step 5: the max voted class will return   maxCount = 0   for key, value in classCount.items():     if value > maxCount:       maxCount = value       maxIndex = key   return maxIndex    if __name__== "__main__":    dataSet, labels = createDataSet()   testX = array([1.2, 1.0])   k = 3   outputLabel = kNNClassify(testX, dataSet, labels, 3)   print "Your input is:", testX, "and classified to class: ", outputLabel   testX = array([0.1, 0.3])   outputLabel = kNNClassify(testX, dataSet, labels, 3)   print "Your input is:", testX, "and classified to class: ", outputLabel            
新聞熱點
疑難解答