Going Deeper›Extra· 45 min read

Exploratory Data Analysis (EDA), Step by Step

Before you chart or model anything, you explore — a deliberate, named routine that turns a fresh dataset into questions worth asking.

What you will learn

Run the four-step EDA routine on a new dataset
Tell univariate from bivariate analysis
Read a correlation heatmap to spot relationships

What EDA is, and why it has a name

You already know the individual moves — head, info, describe, groupby, charts. Exploratory Data Analysis (EDA) is doing them on purpose, in order, the very first thing on any new dataset. The goal is not a finished answer; it is to understand the data and form questions before you clean hard or build a model. Skipping EDA is how people build charts on broken data and trust models that were doomed from the start.

EDA follows a simple loop. We will walk each step on a small employee table.

Look at the whole — size, columns, types, missing values (shape, info).
Summarise the numbers — typical values and spread (describe).
Univariate — study one column at a time (its distribution).
Bivariate — study two columns together (do they relate?).

A small dataset to explore

import pandas as pd

df = pd.DataFrame({
    'dept':       ['Eng', 'Eng', 'Sales', 'Sales', 'Eng', 'Sales'],
    'experience': [5, 2, 8, 3, 10, 6],
    'salary':     [90, 60, 85, 45, 130, 70]
})

Note: Output: (No output — just the table. Six employees with a department, years of experience, and salary in thousands. This is what we will explore.)

Step 1–2: the big picture and the number summary

describe() gives a one-shot summary of every numeric column: how many values, the average, the spread, and the smallest/largest. It is the fastest way to feel a dataset.

A statistical summary of the numeric columns

print(df.describe())

Note: Output: experience salary count 6.000000 6.000000 mean 5.666667 80.000000 std 3.011091 28.982753 min 2.000000 45.000000 25% 3.500000 62.500000 50% 5.500000 77.500000 max 10.000000 130.000000 In one glance: experience runs 2–10 years (average 5.7), salary 45–130k (average 80, but a wide std of 29 — so salaries vary a lot). The 50%~ row is the median. Already we suspect salary is spread unevenly.

Step 3: univariate analysis (one column at a time)

Univariate means “one variable”. You look at a single column on its own to learn its distribution — where values cluster, and whether any outliers lurk. A histogram is the classic tool.

A histogram of one column — univariate analysis

import matplotlib.pyplot as plt

df['salary'].plot(kind='hist', bins=5, title='Salary distribution')
plt.xlabel('Salary (thousands)')
plt.show()

Note: Output: (A histogram with most bars between 45 and 90, and one lonely bar out near 130.) The lonely high bar at 130 is a possible outlier — one person earns far more than the rest. EDA just turned a hidden fact into a question: who is that, and should they skew our averages?

Step 4: bivariate analysis (two columns together)

Bivariate means “two variables”. Now you ask whether two columns relate. The quickest numeric tool is the correlation matrix; the quickest picture is a heatmap, where colour shows the strength of each correlation.

A correlation heatmap — bivariate analysis at a glance

import seaborn as sns
import matplotlib.pyplot as plt

corr = df[['experience', 'salary']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation heatmap')
plt.show()

Note: Output: (A 2×2 grid of coloured squares. The experience–salary squares are warm/red with the number 0.86 printed on them; the diagonal shows 1.0.) annot=True printed the numbers; the warm colour and 0.86 say experience and salary rise together strongly. On a wide table with 20 columns, this single picture shows every pair’s relationship at once — which is why heatmaps are the signature EDA chart.

EDA step	Question	Tools
Overview	How big? What types? Any gaps?	`shape`, `info`
Summary	What are typical values?	`describe()`
Univariate	How is one column spread?	histogram, `value_counts()`
Bivariate	Do two columns relate?	scatter, `corr()`, heatmap

Tip: EDA is detective work, not report-writing. Print, chart, and ask “does that look right?” at every step. The questions you raise here (outliers? a strong correlation?) become the things you clean, test and model next.

Watch out: Do not skip straight to a model because the data “looks fine”. The 130k salary above would silently pull up averages and confuse a model. EDA catches problems like outliers and skew before they poison everything downstream.

Q. You make a scatter plot of two numeric columns to see if they move together. What kind of analysis is this?

Answer: Looking at two variables together (here, with a scatter plot) is bivariate analysis. Studying one column on its own would be univariate.

✍️ Practice

On any dataset, run describe() and write one sentence about the spread of each numeric column.
Draw a histogram of one column (univariate) and a correlation heatmap of all numeric columns (bivariate), and note one thing each reveals.

🏠 Homework

Take a real CSV and write a short EDA report: shape and types, a describe() summary, one histogram, and a correlation heatmap. End with two questions the exploration raised.