Going Deeper›Extra· 45 min read

Feature Engineering: Encoding & Scaling

Models only understand numbers on a fair footing — so you turn raw columns into model-ready features by encoding text and scaling numbers.

What you will learn

Explain why a model needs numeric, scaled inputs
One-hot encode a category column with get_dummies
Scale numbers to a common range with StandardScaler

Clean data is not yet model-ready

Cleaning fixes wrong data — blanks, duplicates, bad types. Feature engineering is the next step: reshaping correct data into the form a model can learn from. A feature is just an input column the model uses to predict. Most accuracy gains in real projects come from better features, not fancier models.

Models do maths, so they hit two walls with raw data: (1) they cannot read text categories like “Pune” or “Delhi”, and (2) they get confused when one column is in thousands and another in single digits. We fix the first with encoding and the second with scaling.

Problem 1: a model cannot read text

Take a city column with values like “Pune” and “Delhi”. A model needs numbers. The naive fix — Pune=0, Delhi=1, Mumbai=2 — is a trap: it tells the model Mumbai (2) is “twice” Pune (0), which is nonsense. The right fix is one-hot encoding: make one yes/no (1/0) column per category.

One-hot encode the city column with get_dummies

import pandas as pd

df = pd.DataFrame({
    'city':  ['Pune', 'Delhi', 'Pune', 'Mumbai'],
    'price': [60, 120, 65, 150]
})

encoded = pd.get_dummies(df, columns=['city'])
print(encoded)

Note: Output: price cityDelhi cityMumbai city_Pune 0 60 False False True 1 120 True False False 2 65 False False True 3 150 False True False get_dummies replaced the one city column with three yes/no columns. Row 0 is Pune, so only city_Pune is True (1). No false ranking is implied — each city is just “on or off”. (True/False count as 1/0 for the model.)

Problem 2: columns on wildly different scales

Imagine predicting from age (around 20–60) and income (around 20,000–90,000). Many models judge “distance” between rows, and income’s big numbers would drown out age completely — not because age matters less, but because its numbers are smaller. Scaling rescales every column to a comparable range so each gets a fair say.

The most common method is standardisation with StandardScaler: it shifts each column to a mean of 0 and rescales it by its spread, so a value becomes “how many standard deviations above or below average”.

Scale two columns to a common range with StandardScaler

from sklearn.preprocessing import StandardScaler
import pandas as pd

df = pd.DataFrame({
    'age':    [20, 40, 60],
    'income': [20000, 50000, 90000]
})

scaler = StandardScaler()
scaled = scaler.fit_transform(df)
print(scaled.round(2))

Note: Output: [[-1.22 -1.13] [ 0. -0.06] [ 1.22 1.19]] Both columns now sit on the same scale: roughly -1.2 to +1.2. The middle row (age 40, income 50000) is near 0 — average for each. Income no longer dwarfs age; the model weighs them fairly. The raw 90000 became a modest 1.19.

A taste of creating new features

Feature engineering also means inventing useful columns from ones you have — especially from dates. A raw date is hard to model; the month or day-of-week hidden inside it often is not.

Engineer month and weekday features from a date

import pandas as pd

df = pd.DataFrame({'order_date': ['2026-01-15', '2026-07-04', '2026-12-25']})
df['order_date'] = pd.to_datetime(df['order_date'])

df['month']   = df['order_date'].dt.month
df['weekday'] = df['order_date'].dt.day_name()
print(df)

Note: Output: order_date month weekday 0 2026-01-15 1 Thursday 1 2026-07-04 7 Saturday 2 2026-12-25 12 Thursday pd.to_datetime turned text into real dates, then .dt.month and .dt.day_name() pulled out new columns a model can use — e.g. to learn that weekend or December orders behave differently. We created information that was hidden in the raw date.

Raw column	Problem	Feature-engineering fix
Text category (city)	Model cannot read text	One-hot encode (`get_dummies`)
Numbers on huge scales	Big numbers dominate	Scale (`StandardScaler`)
A date string	Hard to model directly	Extract month / weekday
Two related columns	Pattern is implicit	Combine into a ratio/total

Tip: Order matters: clean first (fix blanks and types), then engineer features (encode and scale), then split and model. Scaling text or encoding a column with blanks will error or mislead.

Watch out: Do not turn an unordered category into a single number column (Pune=0, Delhi=1, Mumbai=2) — the model will read fake order and distance into it. Use one-hot encoding for categories with no natural ranking.

Q. Why do we one-hot encode a city column instead of just numbering the cities 0, 1, 2?

Answer: Numbering unordered categories (Pune=0, Delhi=1, Mumbai=2) tells the model Delhi is “between” Pune and Mumbai and Mumbai is “twice” Pune — false relationships. One-hot encoding gives each category an independent yes/no column.

✍️ Practice

Take a DataFrame with a category column and one-hot encode it with pd.get_dummies; count how many new columns appear.
Scale two numeric columns with StandardScaler and confirm each scaled column has roughly mean 0.

🏠 Homework

On a real dataset, prepare it for modelling: one-hot encode at least one category column and scale at least one numeric column. Briefly note which columns you encoded vs scaled and why.