加权kNN是k个最近邻居的修改版本。影响kNN算法性能的众多问题之一是超参数k的选择。如果k太小,该算法将对异常值更加敏感。如果k太大,则邻域可能包含来自其他类别的太多点。
另一个问题是组合类标签的方法。最简单的方法是进行多数表决,但是如果最近的邻居的距离变化很大,并且最近的邻居更可靠地指示对象的类别,那么这可能是个问题。
直觉:
考虑以下训练集
红色标签表示0级等级,绿色标签表示1级等级。
将白点视为查询点(必须预测其类标签的点)
如果将上述数据集提供给基于kNN的分类器,则分类器将声明查询点属于类0。但是在图中,很明显,与类相比,该点更接近于类1点。 0分为了克服这个缺点,使用了加权kNN。在加权kNN中,使用称为核函数的函数为最近的k个点赋予权重。加权kNN的直觉是要给附近的点更多的权重,而给更远的点更少的权重。任何函数都可以用作加权knn分类器的内核函数,其值随距离的增加而减小。使用的简单函数是反距离函数。
算法:
- 令L = {(x i ,y i ),i = 1,。 。 。是给定类别y i的观测值x i的训练集合,并让x是必须预测其类别标签y的新观测值(查询点)。
- 计算i ( i ,1,)的d(x i ,x)。 。 。 ,n,查询点与训练集中其他所有点之间的距离。
- 选择D’⊆D,这是与查询点最近的k个训练数据点的集合
- 使用距离加权投票预测查询点的类别。 v代表类标签。使用以下公式
执行:
将0视为0类的标签,将1视为1类的标签。下面是加权kNN算法的实现。
C/C++
// C++ program to implement the
// weighted K nearest neighbour algorithm.
#include
using namespace std;
struct Point
{
int val; // Class of point
double x, y; // Co-ordinate of point
double distance; // Distance from test point
};
// Used to sort an array of points by increasing
// order of weighted distance
bool comparison(Point a, Point b)
{
return (a.distance < b.distance);
}
// This function finds classification of point p using
// weighted k nearest neighbour algorithm. It assumes only
// two groups and returns 0 if p belongs to class 0, else
// 1 (belongs to class 1).
int weightedkNN(Point arr[], int n, int k, Point p)
{
// Fill weighted distances of all points from p
for (int i = 0; i < n; i++)
arr[i].distance =
(sqrt((arr[i].x - p.x) * (arr[i].x - p.x) +
(arr[i].y - p.y) * (arr[i].y - p.y)));
// Sort the Points by weighted distance from p
sort(arr, arr+n, comparison);
// Now consider the first k elements and only
// two groups
double freq1 = 0; // weighted sum of group 0
double freq2 = 0; // weighted sum of group 1
for (int i = 0; i < k; i++)
{
if (arr[i].val == 0)
freq1 += double(1/arr[i].distance);
else if (arr[i].val == 1)
freq2 += double(1/arr[i].distance);
}
return (freq1 > freq2 ? 0 : 1);
}
// Driver code
int main()
{
int n = 13; // Number of data points
Point arr[n];
arr[0].x = 0;
arr[0].y = 4;
arr[0].val = 0;
arr[1].x = 1;
arr[1].y = 4.9;
arr[1].val = 0;
arr[2].x = 1.6;
arr[2].y = 5.4;
arr[2].val = 0;
arr[3].x = 2.2;
arr[3].y = 6;
arr[3].val = 0;
arr[4].x = 2.8;
arr[4].y = 7;
arr[4].val = 0;
arr[5].x = 3.2;
arr[5].y = 8;
arr[5].val = 0;
arr[6].x = 3.4;
arr[6].y = 9;
arr[6].val = 0;
arr[7].x = 1.8;
arr[7].y = 1;
arr[7].val = 1;
arr[8].x = 2.2;
arr[8].y = 3;
arr[8].val = 1;
arr[9].x = 3;
arr[9].y = 4;
arr[9].val = 1;
arr[10].x = 4;
arr[10].y = 4.5;
arr[10].val = 1;
arr[11].x = 5;
arr[11].y = 5;
arr[11].val = 1;
arr[12].x = 6;
arr[12].y = 5.5;
arr[12].val = 1;
/*Testing Point*/
Point p;
p.x = 2;
p.y = 4;
// Parameter to decide the class of the query point
int k = 5;
printf ("The value classified to query point"
" is: %d.\n", weightedkNN(arr, n, k, p));
return 0;
}
Python3
# Python3 program to implement the
# weighted K nearest neighbour algorithm.
import math
def weightedkNN(points,p,k=3):
'''
This function finds classification of p using
weighted k nearest neighbour algorithm. It assumes only two
two classes and returns 0 if p belongs to class 0, else
1 (belongs to class 1).
Parameters -
points : Dictionary of training points having two keys - 0 and 1
Each key have a list of training data points belong to that
p : A tuple ,test data point of form (x,y)
k : number of nearest neighbour to consider, default is 3
'''
distance=[]
for group in points:
for feature in points[group]:
#calculate the euclidean distance of p from training points
euclidean_distance = math.sqrt((feature[0]-p[0])**2 +(feature[1]-p[1])**2)
# Add a tuple of form (distance,group) in the distance list
distance.append((euclidean_distance,group))
# sort the distance list in ascending order
# and select first k distances
distance = sorted(distance)[:k]
freq1 = 0 # weighted sum of group 0
freq2 = 0 # weighted sum of group 1
for d in distance:
if d[1] == 0:
freq1 += (1 / d[0])
elif d[1] == 1:
freq2 += (1 /d[0])
return 0 if freq1>freq2 else 1
# Driver function
def main():
# Dictionary of training points having two keys - 0 and 1
# key 0 have points belong to class 0
# key 1 have points belong to class 1
points = {0:[(0, 4),(1, 4.9),(1.6, 5.4),(2.2, 6),(2.8, 7),(3.2, 8),(3.4, 9)],
1:[(1.8, 1),(2.2, 3),(3, 4),(4, 4.5),(5, 5),(6, 5.5)]}
# query point p(x,y)
p = (2, 4)
# Number of neighbours
k = 5
print("The value classified to query point is: {}".format(weightedkNN(points,p,k)))
if __name__ == '__main__':
main()
输出:
The value classified to query point is: 1