Feature Engineering: Encoding & Scaling
Models only understand numbers on a fair footing — so you turn raw columns into model-ready features by encoding text and scaling numbers.
What you will learn
- Explain why a model needs numeric, scaled inputs
- One-hot encode a category column with get_dummies
- Scale numbers to a common range with StandardScaler
Clean data is not yet model-ready
Cleaning fixes wrong data — blanks, duplicates, bad types. Feature engineering is the next step: reshaping correct data into the form a model can learn from. A feature is just an input column the model uses to predict. Most accuracy gains in real projects come from better features, not fancier models.
Models do maths, so they hit two walls with raw data: (1) they cannot read text categories like “Pune” or “Delhi”, and (2) they get confused when one column is in thousands and another in single digits. We fix the first with encoding and the second with scaling.
Problem 1: a model cannot read text
Take a city column with values like “Pune” and “Delhi”. A model needs numbers. The naive fix — Pune=0, Delhi=1, Mumbai=2 — is a trap: it tells the model Mumbai (2) is “twice” Pune (0), which is nonsense. The right fix is one-hot encoding: make one yes/no (1/0) column per category.
import pandas as pd
df = pd.DataFrame({
'city': ['Pune', 'Delhi', 'Pune', 'Mumbai'],
'price': [60, 120, 65, 150]
})
encoded = pd.get_dummies(df, columns=['city'])
print(encoded)Note: Output:
price cityDelhi cityMumbai city_Pune
0 60 False False True
1 120 True False False
2 65 False False True
3 150 False True False
get_dummies replaced the one city column with three yes/no columns. Row 0 is Pune, so only city_Pune is True (1). No false ranking is implied — each city is just “on or off”. (True/False count as 1/0 for the model.)
Problem 2: columns on wildly different scales
Imagine predicting from age (around 20–60) and income (around 20,000–90,000). Many models judge “distance” between rows, and income’s big numbers would drown out age completely — not because age matters less, but because its numbers are smaller. Scaling rescales every column to a comparable range so each gets a fair say.
The most common method is standardisation with StandardScaler: it shifts each column to a mean of 0 and rescales it by its spread, so a value becomes “how many standard deviations above or below average”.
from sklearn.preprocessing import StandardScaler
import pandas as pd
df = pd.DataFrame({
'age': [20, 40, 60],
'income': [20000, 50000, 90000]
})
scaler = StandardScaler()
scaled = scaler.fit_transform(df)
print(scaled.round(2))Note: Output: [[-1.22 -1.13] [ 0. -0.06] [ 1.22 1.19]] Both columns now sit on the same scale: roughly -1.2 to +1.2. The middle row (age 40, income 50000) is near 0 — average for each. Income no longer dwarfs age; the model weighs them fairly. The raw 90000 became a modest 1.19.
A taste of creating new features
Feature engineering also means inventing useful columns from ones you have — especially from dates. A raw date is hard to model; the month or day-of-week hidden inside it often is not.
import pandas as pd
df = pd.DataFrame({'order_date': ['2026-01-15', '2026-07-04', '2026-12-25']})
df['order_date'] = pd.to_datetime(df['order_date'])
df['month'] = df['order_date'].dt.month
df['weekday'] = df['order_date'].dt.day_name()
print(df)Note: Output:
order_date month weekday
0 2026-01-15 1 Thursday
1 2026-07-04 7 Saturday
2 2026-12-25 12 Thursday
pd.to_datetime turned text into real dates, then .dt.month and .dt.day_name() pulled out new columns a model can use — e.g. to learn that weekend or December orders behave differently. We created information that was hidden in the raw date.
| Raw column | Problem | Feature-engineering fix |
|---|---|---|
| Text category (city) | Model cannot read text | One-hot encode (get_dummies) |
| Numbers on huge scales | Big numbers dominate | Scale (StandardScaler) |
| A date string | Hard to model directly | Extract month / weekday |
| Two related columns | Pattern is implicit | Combine into a ratio/total |
Tip: Order matters: clean first (fix blanks and types), then engineer features (encode and scale), then split and model. Scaling text or encoding a column with blanks will error or mislead.
Watch out: Do not turn an unordered category into a single number column (Pune=0, Delhi=1, Mumbai=2) — the model will read fake order and distance into it. Use one-hot encoding for categories with no natural ranking.
Q. Why do we one-hot encode a city column instead of just numbering the cities 0, 1, 2?
✍️ Practice
- Take a DataFrame with a category column and one-hot encode it with
pd.get_dummies; count how many new columns appear. - Scale two numeric columns with
StandardScalerand confirm each scaled column has roughly mean 0.
🏠 Homework
- On a real dataset, prepare it for modelling: one-hot encode at least one category column and scale at least one numeric column. Briefly note which columns you encoded vs scaled and why.