Machine Learning for AIExtra· 40 min read

Computer Vision: Teaching Machines to See

To a computer an image is just a grid of numbers — computer vision is the AI that turns those numbers into “that’s a cat”.

What you will learn

  • Explain how an image becomes numbers (pixels)
  • Tell apart classification, detection and recognition
  • Describe at a high level how a CNN sees

Why “seeing” is hard for a computer

You glance at a photo and instantly know it shows a cat. A computer cannot do that directly — it has no eyes and no idea what a cat is. Computer Vision (CV) is the field of AI that lets software make sense of images and video: classify what is in them, find where objects are, and recognise faces.

An image is just a grid of numbers

The first thing to understand: to a computer, a picture is a grid of pixels, and each pixel is just a number for its brightness. A grayscale image uses one number per pixel (0 = black, 255 = white); a colour image uses three numbers per pixel (how much Red, Green and Blue). That is all an image really is.

Every image is a grid of pixel numbers — this is what the AI actually receives
# A tiny 3x3 grayscale image as numbers (0 = black, 255 = white)
image = [
    [  0,   0,   0],
    [255, 255, 255],
    [  0,   0,   0],
]

# The middle row is bright (255) — that is a white horizontal stripe.
for row in image:
    print(row)

Note: Output: [0, 0, 0] [255, 255, 255] [0, 0, 0] A human “sees” a white stripe across a black square. The computer only sees these nine numbers — and computer vision is the maths that turns numbers like these into “stripe” or “cat”.

The three core vision tasks

TaskThe question it answersEveryday example
Image classification“What is in this picture?” (one label)Is this photo a cat or a dog?
Object detection“What objects are here, and where?” (labels + boxes)A self-driving car boxing pedestrians and signs
Face recognition“Whose face is this?”Phone face-unlock, photo-app tagging

In short: classification gives the picture one label, detection draws labelled boxes around several objects, and recognition matches a face to a specific identity. They get harder going down the list.

How a CNN “sees”

The workhorse of computer vision is the CNN (Convolutional Neural Network) you met earlier. Here is the intuition for how it turns that grid of numbers into a label — built up in layers:

  1. It slides small filters (tiny grids of weights) across the image. Each filter lights up when it finds a particular tiny pattern, like an edge or a corner.
  2. The next layer combines those edges into simple shapes — curves, circles, textures.
  3. Deeper layers combine shapes into parts — an eye, an ear, a wheel.
  4. The final layer combines the parts into a whole-object decision — “these two eyes + whiskers + pointy ears → cat.”

So a CNN builds understanding from small to big: pixels → edges → shapes → parts → object. Crucially, it learns which filters to use from thousands of labelled images — nobody programs “a cat has whiskers” by hand.

Using a trained CNN: pixels in, a labelled guess out
# What using a trained vision model looks like (concept, not run here)
# A pre-trained model takes the pixel grid and returns label + confidence.
prediction = vision_model.predict(image)

print(prediction)

Note: Output: {'label': 'cat', 'confidence': 0.97} The model is 97% sure the image is a cat. Notice we did not write any rules about ears or whiskers — the CNN learned the visual pattern of “cat” from many example photos, just like the spam example learned “spam”.

Watch out: Computer vision can fail in surprising ways: a model that has only seen daytime photos may misread night scenes, and tiny, deliberate changes to pixels (called adversarial examples) can fool it into seeing the wrong thing. In safety-critical uses like self-driving, this matters a lot.

Tip: Connect it all: CV is just machine learning applied to the pixel grid. The image-as-numbers idea is the bridge — once a picture is numbers, the same “learn the pattern from examples” engine you already know does the rest.

Q. What does a computer actually receive when you give it a photo?

Answer: An image is stored as a grid of pixel numbers (one value for grayscale, three R/G/B values for colour). Computer vision is the maths that turns those numbers into meaning.

✍️ Practice

  1. Write a 3x3 grid of numbers that would look like a vertical white stripe down the middle.
  2. Label each as classification, detection, or recognition: “is this email’s photo a meme?”, “unlock my phone with my face”, “box every car in a street photo”.

🏠 Homework

  1. Find three apps on your phone that use computer vision and write one sentence each on what they do with images.
Want to learn this with a mentor?

CodingClave runs guided, project-based training (28-day, 45-day & 6-month batches).

Explore Training →