Reading

With regression models, we were able to measure the effectiveness of models by taking the distance between our predicted value ($\hat{y}$) and the true value ($y$). In classification models, however, the output is simply which class an observation belongs to - there is no distance to measure.
For example, if you have a picture of a cat and a model classifies it as a horse, how do you measure the distance between the prediction of horse and the true value of cat? You can’t. So, we go to a different form of evaluation:
There are a number of classification models. Most of these will be addressed in MATH 3480. This semester, we will focus on the k-Nearest Neighbors algorithm.
The k-Nearest Neighbors (kNN) algorithm takes a point and gives it a classification based on the characteristics of points near it.
The value of $k$ tells kNN to look at the $k$ nearest points.
Example: Cats have sharper claws and shorter ears, dogs have less sharp claws and longer ears
- Create graph on the board with claw sharpness on the x-axis and ear length on the y-axis
- Add a number of ‘C’ and ‘D’ points to the board to indicate pre-known values
- Add another unknown point
- Count the number of nearby points and make an estimate of what the unknown point likely is
As $k$ changes, it could change the result. How do we know what value of $k$ to use? This is kind of arbitrary, but we generally use the following rule: \(k = \sqrt{n}\)
What if the number of points around my point is split evenly between groups?
How do we determine the distance?
The euclidean distance is the typical distance we have learned since gradeschool. It can be depicted with a right triangle and solved using the Pythagorean Theorem.
First, let’s say that we have two points $a$ and $b$. We want the distance $d$ between $a$ and $b$. Let’s define $d_x$ to be the distance in the x-dimension, so $d_x = a_x - b_x$. Then we can depict the distance between two points as a right triangle with sides $d_x$ and $d_y$ with $d$ as the hypoteneuse.
In 2 dimensions, the distance between points $a$ and $b$ is $d$ and can be calculated as,
\[d = \sqrt{d_x^2 + d_y^2}\]In 3 dimensions, the distance is,
\[d = \sqrt{d_x^2 + d_y^2 + d_z^2}\]What would be the distance in $n$ dimensions?
\[d = \sqrt{\sum_{i=0}^n d_i^2}\]With the euclidean distance, we squared each term. Why? Because Pythagoras showed if we want it to make sense physically, it needs to be that way. However, there are other ways to express distance.
Imagine you are in New York City and you want to go from $a$ to $b$. Can you go in a straight line? Not likely. You likely actually have to travel the length of $d_x$ and $d_y$. In this case,
\[d = d_x + d_y\]Essentially, we are just adding the length of all the dimensions together. Because this example demonstrates the idea so well, we often call this type of distance the Manhattan distance. In $n$ dimensions,
\[d = \sum_{i=0}^n |d_i|\]We have seen how to measure a distance with a power of 1 and with a power of 2. Can we do it with a power of 3? 4? 20? 100? Yes! This distance measure is often called a Norm. When we deal with a norm of power $p$, we specifically call it an $L_p$ norm.
\[L_p = \left(\sum_{i=0}^n |d_i|^p\right)^{1/p}\]It is hard to visualize the norms beyond the L2-norm. But we can see what it happening numerically.
Show
17_kNN_NormsDemo.xlsx
There is another useful norm that is commonly seen, known as the L$\infty$-Norm.
\[L_\infty = \left(\sum_{i=0}^n |d_i|^\infty\right)^{1/\infty}\]What does this do?
Many of these norms are used throughout Data Science. With the kNN algorithm, the standard (default) option is generally the euclidean distance.
Let’s take an example where we have a sample of 100 training points being divided into three categories A, B, and C. We take the nearest $k$ points where $k$ is the prime number nearest $\sqrt{n} = \sqrt{100} = 10$. So, let’s choose $k=11$.
We put a point on the graph and count the nearest $k=11$ points and get the following results:
| A | B | C |
|---|---|---|
| 2 | 8 | 1 |
Once we have counted the $k$ nearest neighbors, how do we interpret the counts?
Classification models often have two methods to interpret:
The kNN model is a classification model used to categorize points based on the $k$ nearest points around them.
However, we can actually also use the kNN algorithm as a regressor as well! To predict the value of a point, we find the $k$ nearest points, then average their $y$ values. This average will be our prediction $\hat{y}$.
\[\hat{y} = \frac{1}{k}\sum_{j=1}^k y_j\]