Computer Vision: Teaching Machines to See
To a computer an image is just a grid of numbers — computer vision is the AI that turns those numbers into “that’s a cat”.
What you will learn
- Explain how an image becomes numbers (pixels)
- Tell apart classification, detection and recognition
- Describe at a high level how a CNN sees
Why “seeing” is hard for a computer
You glance at a photo and instantly know it shows a cat. A computer cannot do that directly — it has no eyes and no idea what a cat is. Computer Vision (CV) is the field of AI that lets software make sense of images and video: classify what is in them, find where objects are, and recognise faces.
An image is just a grid of numbers
The first thing to understand: to a computer, a picture is a grid of pixels, and each pixel is just a number for its brightness. A grayscale image uses one number per pixel (0 = black, 255 = white); a colour image uses three numbers per pixel (how much Red, Green and Blue). That is all an image really is.
# A tiny 3x3 grayscale image as numbers (0 = black, 255 = white)
image = [
[ 0, 0, 0],
[255, 255, 255],
[ 0, 0, 0],
]
# The middle row is bright (255) — that is a white horizontal stripe.
for row in image:
print(row)Note: Output: [0, 0, 0] [255, 255, 255] [0, 0, 0] A human “sees” a white stripe across a black square. The computer only sees these nine numbers — and computer vision is the maths that turns numbers like these into “stripe” or “cat”.
The three core vision tasks
| Task | The question it answers | Everyday example |
|---|---|---|
| Image classification | “What is in this picture?” (one label) | Is this photo a cat or a dog? |
| Object detection | “What objects are here, and where?” (labels + boxes) | A self-driving car boxing pedestrians and signs |
| Face recognition | “Whose face is this?” | Phone face-unlock, photo-app tagging |
In short: classification gives the picture one label, detection draws labelled boxes around several objects, and recognition matches a face to a specific identity. They get harder going down the list.
How a CNN “sees”
The workhorse of computer vision is the CNN (Convolutional Neural Network) you met earlier. Here is the intuition for how it turns that grid of numbers into a label — built up in layers:
- It slides small filters (tiny grids of weights) across the image. Each filter lights up when it finds a particular tiny pattern, like an edge or a corner.
- The next layer combines those edges into simple shapes — curves, circles, textures.
- Deeper layers combine shapes into parts — an eye, an ear, a wheel.
- The final layer combines the parts into a whole-object decision — “these two eyes + whiskers + pointy ears → cat.”
So a CNN builds understanding from small to big: pixels → edges → shapes → parts → object. Crucially, it learns which filters to use from thousands of labelled images — nobody programs “a cat has whiskers” by hand.
# What using a trained vision model looks like (concept, not run here)
# A pre-trained model takes the pixel grid and returns label + confidence.
prediction = vision_model.predict(image)
print(prediction)Note: Output: {'label': 'cat', 'confidence': 0.97} The model is 97% sure the image is a cat. Notice we did not write any rules about ears or whiskers — the CNN learned the visual pattern of “cat” from many example photos, just like the spam example learned “spam”.
Watch out: Computer vision can fail in surprising ways: a model that has only seen daytime photos may misread night scenes, and tiny, deliberate changes to pixels (called adversarial examples) can fool it into seeing the wrong thing. In safety-critical uses like self-driving, this matters a lot.
Tip: Connect it all: CV is just machine learning applied to the pixel grid. The image-as-numbers idea is the bridge — once a picture is numbers, the same “learn the pattern from examples” engine you already know does the rest.
Q. What does a computer actually receive when you give it a photo?
✍️ Practice
- Write a 3x3 grid of numbers that would look like a vertical white stripe down the middle.
- Label each as classification, detection, or recognition: “is this email’s photo a meme?”, “unlock my phone with my face”, “box every car in a street photo”.
🏠 Homework
- Find three apps on your phone that use computer vision and write one sentence each on what they do with images.