Group & Aggregate with groupby
Answer “per category” questions — total sales per region, average price per product — by grouping and summarising.
What you will learn
- Group rows by a column
- Aggregate each group (sum, mean, count)
- Read a grouped result
The “per category” question
So many real questions sound like “… per …”: sales per region, average score per class, orders per customer. The tool for all of them is groupby. It splits the table into groups, then summarises each group into one number.
Think of it as three steps — often called split–apply–combine: split the rows into groups, apply a summary to each group, and combine the answers into a small table.
Picture sorting a deck of cards into suits, then counting each pile. Sorting into piles is the split; counting each pile is the apply; writing the four totals on one line is the combine. groupby does all three for you.
import pandas as pd
df = pd.DataFrame({
'region': ['North', 'South', 'North', 'South', 'North'],
'amount': [250, 120, 130, 850, 300]
})
print(df.groupby('region')['amount'].sum())Note: Output: region North 680 South 970 Name: amount, dtype: int64 Pandas split the rows by region, then summed the amount in each. North totals 680 (250+130+300), South totals 970 (120+850).
Other summaries
Swap .sum() for .mean(), .count(), .max() — whatever question you are asking.
print(df.groupby('region')['amount'].mean()) # average per region
print('---')
print(df.groupby('region')['amount'].count()) # how many sales per regionNote: Output: region North 226.666667 South 485.000000 Name: amount, dtype: float64 --- region North 3 South 2 Name: amount, dtype: int64 North averages about 227 across its 3 sales; South averages 485 across 2 sales. Same groups, different question.
| You want… | Aggregation | Question it answers |
|---|---|---|
| Total | .sum() | Sales per region |
| Average | .mean() | Average price per product |
| Count | .count() | How many orders per customer |
| Biggest | .max() | Top sale per region |
Tip: The pattern is always the same: df.groupby(WHAT_TO_GROUP_BY)[WHICH_COLUMN].SUMMARY(). Read it as “for each region, take the amount, and sum it”.
Watch out: A bare df.groupby('region') does nothing visible on its own — you must tell it which column to summarise and how (e.g. ['amount'].sum()), or it just returns a grouping object.
Q. What does df.groupby('region')['amount'].mean() calculate?
✍️ Practice
- Add a
productcolumn to the table and compute the totalamountper product. - Use
groupbyto find the averageamountper region, then the count per region.
🏠 Homework
- On a real CSV, pick a category column and a number column, and answer one “per category” question (e.g. average price per category). Write the result in a sentence.