K-Nearest Neighbours (KNN)
To classify a new point, look at its closest neighbours and copy the majority.
What you will learn
- Explain the KNN idea
- Train KNeighborsClassifier
- See how k changes the result
Judge a point by its neighbours
K-nearest neighbours (KNN) is the most intuitive classifier of all. To label a new point, it finds the k closest examples it has seen and takes a majority vote.
It is the old saying “you are the company you keep.” If a new fruit is closest to three apples and one orange, KNN calls it an apple.
A worked example: fruit by size and weight
Each fruit has two features — width and weight — and a label. We will classify a mystery fruit by its nearest neighbours.
from sklearn.neighbors import KNeighborsClassifier
# Features: [width_cm, weight_g]
X = [[7, 150], [7, 170], [6, 140], # apples
[8, 110], [9, 120], [8, 100]] # oranges
y = ['apple','apple','apple',
'orange','orange','orange']
model = KNeighborsClassifier(n_neighbors=3) # look at 3 neighbours
model.fit(X, y)
mystery = [[7, 160]]
print('Mystery fruit ->', model.predict(mystery)[0])Note: Output: Mystery fruit -> apple The mystery fruit (7 cm, 160 g) is closest to the three apples, which are heavier. Three out of three nearest neighbours are apples, so KNN votes “apple”.
What “closest” actually means
KNN measures distance between points, just like distance on a map. For two features it is the straight-line (Pythagoras) distance. Let us compute how far our mystery fruit [7, 160] is from one apple [7, 150] and one orange [8, 110] so you can see why it picks apple.
mystery = [7, 160]
def distance(a, b):
return ((a[0]-b[0])**2 + (a[1]-b[1])**2) ** 0.5
print('to apple [7,150]: ', round(distance(mystery, [7, 150]), 1))
print('to orange [8,110]:', round(distance(mystery, [8, 110]), 1))Note: Output: to apple [7,150]: 10.0 to orange [8,110]: 50.0 The apple is only 10 away but the orange is 50 away, so the apple is a much nearer neighbour. KNN does this distance sum for every known point, keeps the k smallest, and votes. (Notice weight dominated here — that is the scaling problem in the warning below.)
Choosing k
The k in KNN is how many neighbours vote. It changes the answer. (In the table, an outlier means a stray, unusual data point that sits far from the rest — like one giant apple among normal ones; with k = 1 a single outlier can swing the vote.)
| k value | Behaviour | Risk |
|---|---|---|
| Small (k = 1) | Very sensitive, follows every point | Reacts to noise / outliers |
| Large (k = 15) | Very smooth, averages many points | Can blur real boundaries |
| Medium (k = 3–7) | Usually a good balance | Try a few and test |
Watch out: KNN compares distances, so features on bigger scales (like weight 100–170) can drown out smaller ones (like width 6–9). Feature scaling fixes this — a topic in the last unit.
Tip: KNN does almost no work when training — it just stores the data. The effort happens at predict time, when it measures distances. That makes it simple but slow on very large datasets.
Q. How does KNN decide the label of a new point?
✍️ Practice
- Change
n_neighborsto 1 and re-predict the mystery fruit. Does the answer change? - Add a mystery fruit [9, 115] and predict it — is it an apple or orange?
🏠 Homework
- Explain in your own words why a very large k could make KNN ignore a small but real group in the data.