项目 | kNN |分类 IRIS 数据集

简介 | kNN算法

统计学习是指一组数学和计算工具来理解数据。在通常称为监督学习中，目标是根据一个或多个输入估计或预测输出。输入有许多名称，如预测变量、自变量、特征和变量被称为共同的。输出或输出通常被称为响应变量或因变量。如果响应是定量的——比如测量体重或身高的数字，我们称这些问题为回归问题。如果响应是定性——比如说，是或否，或蓝色或绿色，我们称这些问题为分类问题。本案例研究涉及一种特定的分类方法。目标是建立一个分类器，以便当它收到一个新的观察类别时未知，它将尝试根据它确实知道真实类别的观察值将该观察值分配给一个类别或一个类。这种特定的方法是已知的作为 k-最近邻分类器，或简称 kNN。给定一个正整数 k，比如 5 和一个新数据点，它首先识别数据中离该点最近的那些 k 个点，并将新数据点分类为属于这 k 个邻居中最常见的类别。

目标：构建我们自己的 k – 最近邻分类器，对来自 scikit-learn 的 IRIS 数据集的数据进行分类。

两点之间的距离

我们将编写一个函数，它将找到 xy 平面中两个给定二维点之间的距离。我们将导入 numpy，以利用 numpy 数组来存储坐标。找到两点之间的距离将有助于寻找输入点的最近邻点。

import numpy as np
  
def distance(p1, p2):
    return np.sqrt(np.sum(np.power(p2-p1, 2))) #distance between two points
p1 = np.array([1, 1])   #coordinate x = 1, y = 1
p2 = np.array([4, 4])   #coordinate x = 4, y = 4
distance(p1, p2)

多数票计数器

我们将在 numpy 数组的帮助下创建一个 3 x 3 点矩阵来构建平面中分散点的环境。我们还将创建一个名为 many_vote() 的函数来查找特定投票列表的最高计数/投票，例如 ( 1, 2, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 3, 2, 3) etc.This is inectly the mode of given data, so can also be在 scipy 统计模块的帮助下计算。我们将创建另一个名为 many_vote_short() 的函数，它将执行与 many_vote() 相同的功能，但将使用 scipy.stats 中的 mode()。这两个函数对于预测点以后。
我们的目标是建立一个 kNN 分类器，因此我们需要开发一种算法来找到给定一组点的最近邻。假设我们需要在给定一组现有点的环境中将一个点插入 xy 平面。我们将必须将我们希望插入的点分类到现有点的类别之一中，然后相应地插入。因此，我们将构建一个函数find_nearest_neighbors() 来查找给定点的最近邻居。它将采用（i）我们希望插入（ii）现有点集和（iii）k 的点有助于索引，作为函数的参数。我们将通过在 matplotlib 的帮助下绘制填充点的 xy 平面来可视化情况。

import numpy as np
import random
import scipy.stats as ss
import matplotlib.pyplot as plt
  
points = np.array([[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3], [3, 1], [3, 2], [3, 3]])  
   #points = existing points
p = np.array([2.5, 2])   #p = point we wish to insert
  
def majority_vote(votes):
    vote_counts = {}
    for vote in votes:
        if vote in vote_counts:
           vote_counts[vote]+= 1
        else:
            vote_counts[vote]= 1
    winners = []
    max_count = max(vote_counts.values())
    for vote, count in vote_counts.items():
        if count == max_count:
            winners.append(vote)
    return random.choice(winners) #returns winner randomly if there are more than 1 winner
  
#>>>votes =[1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 1, 1, 1, 2, 3, 3, 3, 2, 2, 2, 3, 2, 3, 1, 1, 2]
#sample vote counts above
# >>>vote_counts = majority_vote(votes)
  
def majority_vote_short(votes):
    mode, count = ss.mstats.mode(votes)
    return mode
  
def find_nearest_neighbours(p, points, k = 5):  #algorithm to find the nearest neighbours
    distances = np.zeros(points.shape[0])
    for i in range(len(distances)):
        distances[i]= distance(p, points[i])
    ind = np.argsort(distances)      #returns index, according to sorted values in array
    return ind[:k]
  
ind = find_nearest_neighbours(p, points, 2);print(points[ind])
 #gives the nearest neighbour's for this sample case
  
plt.plot(points[:, 0], points[:, 1], "ro")
plt.plot(p[0], p[1], "bo")
plt.axis([0.5, 3.5, 0.5, 3.5])
plt.show()

围绕合成数据的 kNN 预测

找到最近的邻居后，我们必须预测输入点的类别。我们将构建一个名为 knn_predict() 的函数，它将预测我们希望插入的点的类别。我们可以构建另一个名为 generate_synth_data() 的函数来在 xy 平面中生成合成点。

import numpy as np
import random
import scipy.stats as ss
import matplotlib.pyplot as plt
  
''' add the functions and libraries from previous programmes '''
  
def knn_predict(p, points, outcomes, k = 5):
    ind = find_nearest_neighbours(p, points, k)
    return majority_vote(outcomes[ind])
  
outcomes = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1])
knn_predict(np.array([2.5, 2.7]), points, outcomes, k = 2)
  
def generate_synth_data(n = 50):
    points = np.concatenate((ss.norm(0, 1).rvs((n, 2)), ss.norm(1, 1).rvs((n, 2))), axis = 0)
    outcomes = np.concatenate((np.repeat(0, n), np.repeat(1, n)))
    return (points, outcomes)
  
n = 20
plt.figure()
plt.plot(points[:n, 0], points[:n, 1], "ro")
plt.plot(points[n:, 0], points[n:, 1], "bo")
plt.show()

kNN 预测网格

我们将构建一个名为 make_prediction_grid() 的函数，它将创建一个网格并在网格中分配不同类别的点。必须创建另一个函数plot_prediction_grid() 以使用 matplotlib 绘制 make_prediction_grid() 的输出。

import numpy as np
import random
import scipy.stats as ss
import matplotlib.pyplot as plt
  
def make_prediction_grid(predictors, outcomes, limits, h, k):
    (x_min, x_max, y_min, y_max) = limits
    xs = np.arange(x_min, x_max, h)
    ys = np.arange(y_min, y_max, h)
    xx, yy = np.meshgrid(xs, ys)
  
    prediction_grid = np.zeros(xx.shape, dtype = int)
    for i, x in enumerate(xs):
        for j, y in enumerate(ys):
            p = np.array([x, y])
            prediction_grid[j, i] = knn_predict(p, predictors, outcomes, k)
    return (xx, yy, prediction_grid)
  
def plot_prediction_grid (xx, yy, prediction_grid, filename):
    """ Plot KNN predictions for every point on the grid."""
    from matplotlib.colors import ListedColormap
    background_colormap = ListedColormap (["hotpink", "lightskyblue", "yellowgreen"])
    observation_colormap = ListedColormap (["red", "blue", "green"])
    plt.figure(figsize =(10, 10))
    plt.pcolormesh(xx, yy, prediction_grid, cmap = background_colormap, alpha = 0.5)
    plt.scatter(predictors[:, 0], predictors [:, 1], c = outcomes, cmap = observation_colormap, s = 50)
    plt.xlabel('Variable 1'); plt.ylabel('Variable 2')
    plt.xticks(()); plt.yticks(())
    plt.xlim (np.min(xx), np.max(xx))
    plt.ylim (np.min(yy), np.max(yy))
    plt.savefig(filename)
  
(predictors, outcomes) = generate_synth_data()
# >>>predictors.shape
# >>>outcomes.shape
k = 5; filename ="knn_synth_5.pdf"; limits =(-3, 4, -3, 4); h = 0.1
(xx, yy, prediction_grid) = make_prediction_grid(predictors, outcomes, limits, h, k)
plot_prediction_grid(xx, yy, prediction_grid, filename)
plt.show()

输出：这里显示的图是一个两个类的网格，视觉上显示为粉红色和绿色。我们试图根据它们的位置和环境来预测点的类别。绿色点必须落在网格的绿色砖块中，并且网格的粉红色砖块中的红色。查看放大视图以直观地检查分类器的工作。

对 IRIS 数据集进行分类

我们将在名为“IRIS”的 scikit 学习数据集上测试我们的分类器。为了导入“IRIS”，我们需要从 sklearn 导入数据集并调用函数datasets.load_iris()。“IRIS”数据集保存有关萼片长度的信息，三种不同类别鸢尾花的萼片宽度、花瓣长度和花瓣宽度 - Iris-Setosa、Iris-Versicolour 和 Iris-Verginica。基于数据集的数据，我们需要使用分类器对它们进行分类和可视化。 kit learn (sklearn) 库已经拥有一个预先构建的分类器。我们将比较两个分类器 [scikitlearn 与我们构建的那个] 并检查/比较两个分类器的预测准确性。

from sklearn import datasets
import numpy as np
import random
import matplotlib.pyplot as plt
   
iris = datasets.load_iris()
    # >>>iris["data"]
predictors = iris.data[:, 0:2]
outcomes = iris.target
  
plt.plot(predictors[outcomes == 0][:, 0], predictors[outcomes == 0][:, 1], "ro")
plt.plot(predictors[outcomes == 1][:, 0], predictors[outcomes == 1][:, 1], "go")
plt.plot(predictors[outcomes == 2][:, 0], predictors[outcomes == 2][:, 1], "bo")
  
k = 5; filename ="iris_grid.pdf"; limits =(4, 8, 1.5, 4.5); h = 0.1
(xx, yy, prediction_grid) = make_prediction_grid(predictors, outcomes, limits, h, k)
plot_prediction_grid(xx, yy, prediction_grid, filename)
plt.show()
  
from sklearn.neighbors import KNeighborsClassifier #predictions from skikit
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(predictors, outcomes)
sk_predictions = knn.predict(predictors)
  
my_predictions = np.array([knn_predict(p, predictors, outcomes, 5) for p in predictors])
  
   # >>>sk_predictions == my_predictions
   # >>>np.mean(sk_predictions == my_predictions)
print(" prediction by scikit learn : ")
print(100 * np.mean(sk_predictions == outcomes))
print(" prediction by own model : ")
print(100 * np.mean(my_predictions == outcomes))    
 # our homemade predicter is actually somewhat better

输出：从输出看来，我们的分类器实际上比 sklearn 分类器表现得更好。

参考：

edX –HarvardX – 使用Python进行研究