math3080

k-Nearest Neighbors

k-Nearest Neighbors

Reading

Classification Models

Machine Learning Landscape

With regression models, we were able to measure the effectiveness of models by taking the distance between our predicted value ($\hat{y}$) and the true value ($y$). In classification models, however, the output is simply which class an observation belongs to - there is no distance to measure.

For example, if you have a picture of a cat and a model classifies it as a horse, how do you measure the distance between the prediction of horse and the true value of cat? You can’t. So, we go to a different form of evaluation:

There are a number of classification models. Most of these will be addressed in MATH 3480. This semester, we will focus on the k-Nearest Neighbors algorithm.

The Concept behind k-Nearest Neighbors

The k-Nearest Neighbors (kNN) algorithm takes a point and gives it a classification based on the characteristics of points near it.

The value of $k$ tells kNN to look at the $k$ nearest points.

Example: Cats have sharper claws and shorter ears, dogs have less sharp claws and longer ears

Potential Problems/Questions

As $k$ changes, it could change the result. How do we know what value of $k$ to use? This is kind of arbitrary, but we generally use the following rule: \(k = \sqrt{n}\)

What if the number of points around my point is split evenly between groups?

Distance Measures

How do we determine the distance?

Euclidean Distance

The euclidean distance is the typical distance we have learned since gradeschool. It can be depicted with a right triangle and solved using the Pythagorean Theorem.

First, let’s say that we have two points $a$ and $b$. We want the distance $d$ between $a$ and $b$. Let’s define $d_x$ to be the distance in the x-dimension, so $d_x = a_x - b_x$. Then we can depict the distance between two points as a right triangle with sides $d_x$ and $d_y$ with $d$ as the hypoteneuse.

In 2 dimensions, the distance between points $a$ and $b$ is $d$ and can be calculated as,

\[d = \sqrt{d_x^2 + d_y^2}\]

In 3 dimensions, the distance is,

\[d = \sqrt{d_x^2 + d_y^2 + d_z^2}\]

What would be the distance in $n$ dimensions?

\[d = \sqrt{\sum_{i=0}^n d_i^2}\]

Manhattan Distance

With the euclidean distance, we squared each term. Why? Because Pythagoras showed if we want it to make sense physically, it needs to be that way. However, there are other ways to express distance.

Imagine you are in New York City and you want to go from $a$ to $b$. Can you go in a straight line? Not likely. You likely actually have to travel the length of $d_x$ and $d_y$. In this case,

\[d = d_x + d_y\]

Essentially, we are just adding the length of all the dimensions together. Because this example demonstrates the idea so well, we often call this type of distance the Manhattan distance. In $n$ dimensions,

\[d = \sum_{i=0}^n |d_i|\]

Norms

We have seen how to measure a distance with a power of 1 and with a power of 2. Can we do it with a power of 3? 4? 20? 100? Yes! This distance measure is often called a Norm. When we deal with a norm of power $p$, we specifically call it an $L_p$ norm.

\[L_p = \left(\sum_{i=0}^n |d_i|^p\right)^{1/p}\]

It is hard to visualize the norms beyond the L2-norm. But we can see what it happening numerically.

Show 17_kNN_NormsDemo.xlsx

Infinity Norm

There is another useful norm that is commonly seen, known as the L$\infty$-Norm.

\[L_\infty = \left(\sum_{i=0}^n |d_i|^\infty\right)^{1/\infty}\]

What does this do?

\[L_\infty = \left(\sum_{i=0}^n |d_i|^\infty\right)^{1/\infty} = \max{d_i}\]

Many of these norms are used throughout Data Science. With the kNN algorithm, the standard (default) option is generally the euclidean distance.

Hard Classification vs. Soft Classification

Let’s take an example where we have a sample of 100 training points being divided into three categories A, B, and C. We take the nearest $k$ points where $k$ is the prime number nearest $\sqrt{n} = \sqrt{100} = 10$. So, let’s choose $k=11$.

We put a point on the graph and count the nearest $k=11$ points and get the following results:

A B C
2 8 1

Once we have counted the $k$ nearest neighbors, how do we interpret the counts?

Classification models often have two methods to interpret:

k-Nearest Neighbors

The kNN model is a classification model used to categorize points based on the $k$ nearest points around them.

However, we can actually also use the kNN algorithm as a regressor as well! To predict the value of a point, we find the $k$ nearest points, then average their $y$ values. This average will be our prediction $\hat{y}$.

\[\hat{y} = \frac{1}{k}\sum_{j=1}^k y_j\]