Working with Real Data›Extra· 40 min read

Encoding Categories & Feature Engineering

Models only do maths on numbers — turn words like “red” and “Mumbai” into numbers the right way.

What you will learn

Explain why text categories must be encoded
Use one-hot and label encoding correctly
Create a useful new feature from existing ones

Models cannot read words

Every model so far ate numbers. But real data is full of categories — words like red / green / blue, or Mumbai / Delhi / Chennai. A model cannot multiply the word “Mumbai”, so we must turn categories into numbers first. That conversion is called encoding.

The trap: do not just number them

The obvious idea — red = 0, green = 1, blue = 2 — is usually wrong. The model would think blue (2) is “twice” green (1) and “more than” red (0), inventing an order that does not exist. For unordered categories this misleads the model badly.

One-hot encoding: one column per category

The fix is one-hot encoding: make a separate 0/1 column for each category. A row gets a 1 in the column for its category and 0 everywhere else — so no fake ordering is implied. Here is the idea as a table for the colours red, green, blue:

Original colour	is_red	is_green	is_blue
red	1	0	0
green	0	1	0
blue	0	0	1
green	0	1	0

Now no colour is “bigger” than another — each is just on or off. scikit-learn’s OneHotEncoder builds these columns for you:

One-hot encoding turns one colour column into three 0/1 columns

from sklearn.preprocessing import OneHotEncoder

colours = [['red'], ['green'], ['blue'], ['green']]

enc = OneHotEncoder(sparse_output=False)
encoded = enc.fit_transform(colours)

print('Categories:', list(enc.categories_[0]))
for original, row in zip(colours, encoded):
    print(original[0], '->', [int(v) for v in row])

Note: Output: Categories: ['blue', 'green', 'red'] red -> [0, 0, 1] green -> [0, 1, 0] blue -> [1, 0, 0] green -> [0, 1, 0] Each colour became a row of 0s with a single 1 marking its category (the columns are in alphabetical order: blue, green, red). No fake ordering — exactly what we wanted.

When numbering IS fine: ordered categories

If the categories have a real order — like small < medium < large — then numbering them 0, 1, 2 is actually correct, because the order is genuine. This is label / ordinal encoding, and it is right only when an order truly exists. Rule of thumb: ordered → label-encode; unordered → one-hot.

Feature engineering: building better inputs

Feature engineering is creating new, more useful features from the ones you have. A model can only learn from the columns you give it, so a smart new column often helps more than a fancier algorithm. Examples:

From a date of birth, compute age — usually far more predictive than the raw date.
From height and weight, compute BMI (weight ÷ height²) — one column capturing both.
From a price and a discount %, compute the final price the customer actually pays.

A quick worked example: turn height and weight into a single BMI feature.

Engineering a BMI feature from height and weight

people = [
    {'height_m': 1.70, 'weight_kg': 65},
    {'height_m': 1.80, 'weight_kg': 90},
]
for p in people:
    bmi = p['weight_kg'] / (p['height_m'] ** 2)
    print('height', p['height_m'], 'weight', p['weight_kg'],
          '-> BMI', round(bmi, 1))

Note: Output: height 1.7 weight 65 -> BMI 22.5 height 1.8 weight 90 -> BMI 27.8 We combined two raw columns into one meaningful feature. For predicting, say, a health risk, BMI may carry the signal far more directly than height and weight separately.

Watch out: One-hot encoding a column with hundreds of categories (like every postcode) explodes into hundreds of columns and can slow or confuse the model. For very high-cardinality columns, group rare values together or use a smarter encoding.

Tip: Good feature engineering often beats a fancier model. Spend time creating columns that capture real-world meaning — age from a date, ratios, totals — before reaching for a more complex algorithm.

Q. Why is one-hot encoding usually better than numbering unordered categories 0, 1, 2?

Answer: Numbering unordered categories makes the model think one is greater than another. One-hot encoding gives each category its own 0/1 column, so no false ordering is introduced.

✍️ Practice

One-hot encode a city column with values Mumbai, Delhi, Mumbai, Chennai and print the resulting rows.
Decide for each whether to one-hot or label-encode: t-shirt size (S/M/L); blood type (A/B/O); satisfaction (low/medium/high).

🏠 Homework

Invent a 5-row dataset with at least one category column and one pair of columns you could combine, then describe how you would encode the category and what new feature you would engineer.