Encoding Categories & Feature Engineering
Models only do maths on numbers — turn words like “red” and “Mumbai” into numbers the right way.
What you will learn
- Explain why text categories must be encoded
- Use one-hot and label encoding correctly
- Create a useful new feature from existing ones
Models cannot read words
Every model so far ate numbers. But real data is full of categories — words like red / green / blue, or Mumbai / Delhi / Chennai. A model cannot multiply the word “Mumbai”, so we must turn categories into numbers first. That conversion is called encoding.
The trap: do not just number them
The obvious idea — red = 0, green = 1, blue = 2 — is usually wrong. The model would think blue (2) is “twice” green (1) and “more than” red (0), inventing an order that does not exist. For unordered categories this misleads the model badly.
One-hot encoding: one column per category
The fix is one-hot encoding: make a separate 0/1 column for each category. A row gets a 1 in the column for its category and 0 everywhere else — so no fake ordering is implied. Here is the idea as a table for the colours red, green, blue:
| Original colour | is_red | is_green | is_blue |
|---|---|---|---|
| red | 1 | 0 | 0 |
| green | 0 | 1 | 0 |
| blue | 0 | 0 | 1 |
| green | 0 | 1 | 0 |
Now no colour is “bigger” than another — each is just on or off. scikit-learn’s OneHotEncoder builds these columns for you:
from sklearn.preprocessing import OneHotEncoder
colours = [['red'], ['green'], ['blue'], ['green']]
enc = OneHotEncoder(sparse_output=False)
encoded = enc.fit_transform(colours)
print('Categories:', list(enc.categories_[0]))
for original, row in zip(colours, encoded):
print(original[0], '->', [int(v) for v in row])Note: Output: Categories: ['blue', 'green', 'red'] red -> [0, 0, 1] green -> [0, 1, 0] blue -> [1, 0, 0] green -> [0, 1, 0] Each colour became a row of 0s with a single 1 marking its category (the columns are in alphabetical order: blue, green, red). No fake ordering — exactly what we wanted.
When numbering IS fine: ordered categories
If the categories have a real order — like small < medium < large — then numbering them 0, 1, 2 is actually correct, because the order is genuine. This is label / ordinal encoding, and it is right only when an order truly exists. Rule of thumb: ordered → label-encode; unordered → one-hot.
Feature engineering: building better inputs
Feature engineering is creating new, more useful features from the ones you have. A model can only learn from the columns you give it, so a smart new column often helps more than a fancier algorithm. Examples:
- From a date of birth, compute age — usually far more predictive than the raw date.
- From height and weight, compute BMI (weight ÷ height²) — one column capturing both.
- From a price and a discount %, compute the final price the customer actually pays.
A quick worked example: turn height and weight into a single BMI feature.
people = [
{'height_m': 1.70, 'weight_kg': 65},
{'height_m': 1.80, 'weight_kg': 90},
]
for p in people:
bmi = p['weight_kg'] / (p['height_m'] ** 2)
print('height', p['height_m'], 'weight', p['weight_kg'],
'-> BMI', round(bmi, 1))Note: Output: height 1.7 weight 65 -> BMI 22.5 height 1.8 weight 90 -> BMI 27.8 We combined two raw columns into one meaningful feature. For predicting, say, a health risk, BMI may carry the signal far more directly than height and weight separately.
Watch out: One-hot encoding a column with hundreds of categories (like every postcode) explodes into hundreds of columns and can slow or confuse the model. For very high-cardinality columns, group rare values together or use a smarter encoding.
Tip: Good feature engineering often beats a fancier model. Spend time creating columns that capture real-world meaning — age from a date, ratios, totals — before reaching for a more complex algorithm.
Q. Why is one-hot encoding usually better than numbering unordered categories 0, 1, 2?
✍️ Practice
- One-hot encode a
citycolumn with values Mumbai, Delhi, Mumbai, Chennai and print the resulting rows. - Decide for each whether to one-hot or label-encode: t-shirt size (S/M/L); blood type (A/B/O); satisfaction (low/medium/high).
🏠 Homework
- Invent a 5-row dataset with at least one category column and one pair of columns you could combine, then describe how you would encode the category and what new feature you would engineer.