Image taken from Blake Wheeler

K-Nearest Neighbors Tutorial

Hunter Owen
5 min readOct 22, 2020

In this tutorial you will learn about how K-Nearest neighbor algorithm works and how to implement it in Python. K-Nearest Neighbors is part of Scikit learn library. We will be using other Scikit learn modules as well. If you don’t already have it run the code below in your terminal.

pip install -U scikit-learn

This tutorial covers

>How K-Nearest Neighbors works
>Python code on how to implement it on a data set

K-Nearest Neighbors uses a very simple idea for predictive modeling. The algorithm assumes that items in close proximity are the same thing. The proximity is the vector space in which your data makes up. The Proximity in the example below is the classification of whether it’s a triangle or Square. In the example below if amount of closest neighbors was set to 3 it would predict the green ball to be a triangle because there are 2 triangle and only 1 square, but if closest neighbors was set to 5 it would predict the green ball to be a square because there are 3 squares and 2 triangles. The algorithm works by storing all the data in memory and when you introduce it to new data it measures how closely it is to the surrounding data points to determine what it is likely to be.

Image taken from Wikipedia

For this tutorial we will be working with the breast cancer data set and the following imports shown below.

from sklearn.datasets import load_breast_cancerimport pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import plot_confusion_matrix, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV

The breast cancer data set is part of Scikit-learn library and is typically used for classification modeling. The target in this Data frame is the diagnosis of the cancer. If it is benign it will have a 0 and if it malignant it will have a 1. The features of this data set are physical properties of the tumor such as mean area, mean symmetry, and mean texture. To split up the code into the Target y and features X then run the code below.

X, y = load_breast_cancer(return_X_y= True, as_frame = True)

We want to do a test train split on our data like any other model so run the following code below. You can see what the features from the data frame look like as well.

X_train, X_test, y_train, y_test = \
train_test_split(X, y, random_state=42)
X_train.head()

Scaling is highly important when using KNN(K-Nearest Neighbors) because the entire algorithm is based off of distances in the vector space. If you don’t scale the data, certain feature will have a lot more impact when in reality they dont. For more information describing scaling look here.

# We scale both test and train set using a standard scalar
ss = StandardScaler()X_train_ss = pd.DataFrame(ss.fit_transform(X_train),
index= X_train.index,
columns= X_train.columns)
X_test_ss = pd.DataFrame(ss.transform(X_test),
index= X_test.index,
columns= X_test.columns)

Next we want to substantiate the model. The default amount of neighbors is set to 5 so we will keep that in this model to see how it performs on the test data.

# Substantiate the model
# The number of neighbors is default 5
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train_ss, y_train)
y_hat = knn.predict(X_test_ss)
print(f' Accuracy score: {knn.score(X_test_ss,y_test)}')
print(f' Precision score: {precision_score(y_test,y_hat)}')

I got an accuracy score of 95.8% and Precision of 96.6% which is pretty good, but we can do better.

Run the code below to output the confusion matrix and see how many false positives and false negatives.

fig, ax = plt.subplots()fig.suptitle("Tumor Confusion Matrix")
plot_confusion_matrix(knn, X_test_ss, y_test, ax=ax, cmap = 'Blues_r', display_labels=['Malignant','Benign']);
plt.xlabel('Predicted')
plt.ylabel('True')

Ideally we would want to reduce the false negatives as much as possible because this means we thought the patient had a benign tumor, but really they had a malignant tumor. We can try Grid searching next to optimize our KNN model.

param_grid={'n_neighbors': list(range(3,31,2)),
'p':[1,2]}
gs = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=param_grid, cv=5, scoring='precision')

The main parameters that our going to effect our outcome is the number of neighbors we use which we set from 3–29 using only odd numbers. The more neighbors the more bias our model becomes and the less neighbors the more variance our model has. P is how the distance is measured, either euclidian(2) which is taking the pythagorean theorem or manhattan distance (1) which sums the distance in each vector space.

From Digital Humanities

We fit the grid search on our training set

gs.fit(X_train_ss, y_train)
gs.best_params_

The best parameters it gave us were 3 nearest neighbors and using Manhattan distance(1). Lets test this out in a new KNN model

knn2 = KNeighborsClassifier(n_neighbors = 3, p = 1)
knn2.fit(X_train_ss, y_train)
y_hat2 = knn2.predict(X_test_ss)
print(f’ Accuracy score: {knn2.score(X_test_ss,y_test)}’)
print(f’ Precision score: {precision_score(y_test,y_hat2)}’)

This time around I got an accuracy of 97.2% and a Precision score of 97.7%.
Let’s see our confusion matrix.

fig, ax = plt.subplots()fig.suptitle("Tumor Confusion Matrix")
plot_confusion_matrix(knn2, X_test_ss, y_test, ax=ax, cmap = 'Blues_r', display_labels=['Malignant','Benign']);
plt.xlabel('Predicted')
plt.ylabel('True')
plt.savefig('Confusion Matrix2', dpi = 300)

As you can see we reduced false positives and false negatives by 1 compared to our previous model.

Pros

  • Easy to understand
  • Not much computational power at smaller datasets
  • Quick to train

CONS

  • Not good at determining attribute importance unless given
  • Can be slow when predicting on test
  • Not as interoperable as other models

Summary

KNN is a pretty straight forward and easy to grasp compared to other models, however its simplicity can also be its Achilles heel. It’s good to use on small classification data sets, but should always be compared to other model ( decision tree, logisticalRegression , etc.) before using it as a final model.

Citations:

Evert, S., Jannidis, F., Proisl, T., Vitt, T., Schöch, C., Pielström, S., Reger, I. (2016). Outliers or Key Profiles? Understanding Distance Measures for Authorship Attribution. In Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, Kraków, pp. 188–191.

--

--

Hunter Owen
Hunter Owen

No responses yet