Exploratory Data Analysis (EDA), Step by Step
Before you chart or model anything, you explore — a deliberate, named routine that turns a fresh dataset into questions worth asking.
What you will learn
- Run the four-step EDA routine on a new dataset
- Tell univariate from bivariate analysis
- Read a correlation heatmap to spot relationships
What EDA is, and why it has a name
You already know the individual moves — head, info, describe, groupby, charts. Exploratory Data Analysis (EDA) is doing them on purpose, in order, the very first thing on any new dataset. The goal is not a finished answer; it is to understand the data and form questions before you clean hard or build a model. Skipping EDA is how people build charts on broken data and trust models that were doomed from the start.
EDA follows a simple loop. We will walk each step on a small employee table.
- Look at the whole — size, columns, types, missing values (
shape,info). - Summarise the numbers — typical values and spread (
describe). - Univariate — study one column at a time (its distribution).
- Bivariate — study two columns together (do they relate?).
import pandas as pd
df = pd.DataFrame({
'dept': ['Eng', 'Eng', 'Sales', 'Sales', 'Eng', 'Sales'],
'experience': [5, 2, 8, 3, 10, 6],
'salary': [90, 60, 85, 45, 130, 70]
})Note: Output: (No output — just the table. Six employees with a department, years of experience, and salary in thousands. This is what we will explore.)
Step 1–2: the big picture and the number summary
describe() gives a one-shot summary of every numeric column: how many values, the average, the spread, and the smallest/largest. It is the fastest way to feel a dataset.
print(df.describe())Note: Output:
experience salary
count 6.000000 6.000000
mean 5.666667 80.000000
std 3.011091 28.982753
min 2.000000 45.000000
25% 3.500000 62.500000
50% 5.500000 77.500000
max 10.000000 130.000000
In one glance: experience runs 2–10 years (average 5.7), salary 45–130k (average 80, but a wide std of 29 — so salaries vary a lot). The 50%~ row is the median. Already we suspect salary is spread unevenly.
Step 3: univariate analysis (one column at a time)
Univariate means “one variable”. You look at a single column on its own to learn its distribution — where values cluster, and whether any outliers lurk. A histogram is the classic tool.
import matplotlib.pyplot as plt
df['salary'].plot(kind='hist', bins=5, title='Salary distribution')
plt.xlabel('Salary (thousands)')
plt.show()Note: Output: (A histogram with most bars between 45 and 90, and one lonely bar out near 130.) The lonely high bar at 130 is a possible outlier — one person earns far more than the rest. EDA just turned a hidden fact into a question: who is that, and should they skew our averages?
Step 4: bivariate analysis (two columns together)
Bivariate means “two variables”. Now you ask whether two columns relate. The quickest numeric tool is the correlation matrix; the quickest picture is a heatmap, where colour shows the strength of each correlation.
import seaborn as sns
import matplotlib.pyplot as plt
corr = df[['experience', 'salary']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation heatmap')
plt.show()Note: Output:
(A 2×2 grid of coloured squares. The experience–salary squares are warm/red with the number 0.86 printed on them; the diagonal shows 1.0.)
annot=True printed the numbers; the warm colour and 0.86 say experience and salary rise together strongly. On a wide table with 20 columns, this single picture shows every pair’s relationship at once — which is why heatmaps are the signature EDA chart.
| EDA step | Question | Tools |
|---|---|---|
| Overview | How big? What types? Any gaps? | shape, info |
| Summary | What are typical values? | describe() |
| Univariate | How is one column spread? | histogram, value_counts() |
| Bivariate | Do two columns relate? | scatter, corr(), heatmap |
Tip: EDA is detective work, not report-writing. Print, chart, and ask “does that look right?” at every step. The questions you raise here (outliers? a strong correlation?) become the things you clean, test and model next.
Watch out: Do not skip straight to a model because the data “looks fine”. The 130k salary above would silently pull up averages and confuse a model. EDA catches problems like outliers and skew before they poison everything downstream.
Q. You make a scatter plot of two numeric columns to see if they move together. What kind of analysis is this?
✍️ Practice
- On any dataset, run
describe()and write one sentence about the spread of each numeric column. - Draw a histogram of one column (univariate) and a correlation heatmap of all numeric columns (bivariate), and note one thing each reveals.
🏠 Homework
- Take a real CSV and write a short EDA report: shape and types, a describe() summary, one histogram, and a correlation heatmap. End with two questions the exploration raised.