Cleaning Data: Duplicates & Wrong Types
Two more cleaning jobs: remove duplicate rows that distort your totals, and fix columns stored as the wrong type.
What you will learn
- Find and remove duplicate rows
- Spot a column with the wrong type
- Convert text to numbers with astype
Two more messes to fix
After missing values, the next two cleaning jobs are duplicate rows (the same record twice) and wrong types (a number stored as text). Both quietly ruin your results if you miss them.
Here is the order we will follow in this lesson:
- Check the types of every column with
dtypes, to spot any number that is secretly stored as text. - Remove duplicate rows — count them with
duplicated().sum(), then delete the exact repeats withdrop_duplicates(). - Fix the wrong type — convert the text column to real numbers with
astype(int)so maths works. - Confirm — re-check
dtypes(and your row count) to make sure the table is now clean.
Here is a table with both problems to start from:
import pandas as pd
df = pd.DataFrame({
'order': [1, 2, 2, 3],
'price': ['100', '200', '200', '150'] # prices are text, not numbers!
})
print(df.dtypes)Note: Output:
order int64
price object
dtype: object
dtypes shows each column’s type. order is int64 (numbers), but price is object — and in Pandas object means text. The quotes around the numbers in the data were the clue. (The last line, dtype: object, is just describing the little list of types itself — ignore it; the two lines above it are the column types you care about.) Also notice order 2 appears twice — that is the duplicate row we will remove next.
Remove duplicate rows
Duplicate rows inflate counts, sums and averages. Find them with duplicated().sum() and remove exact repeats with drop_duplicates().
print('duplicates:', df.duplicated().sum())
df = df.drop_duplicates()
print('rows now:', len(df))Note: Output: duplicates: 1 rows now: 3 There was exactly 1 fully repeated row (the second order 2). drop_duplicates removed it, leaving 3 unique rows. Our totals will now be correct.
Fix a wrong type
The price column is text, so price.mean() would fail. Convert it to numbers with astype before doing any maths.
df['price'] = df['price'].astype(int)
print('dtype now:', df['price'].dtype)
print('average price:', df['price'].mean())Note: Output: dtype now: int64 average price: 150.0 price is now int64 (real numbers), so .mean() works and gives 150.0 across the three unique orders. Before the conversion, this line would have raised an error.
| Problem | Find it | Fix it |
|---|---|---|
| Duplicate rows | df.duplicated().sum() | drop_duplicates() |
| Text instead of number | df.dtypes (shows object) | astype(int) or pd.to_numeric() |
| Number instead of text | df.dtypes | astype(str) |
Tip: A good overall cleaning order: info() to spot problems → fix missing values → fix wrong types → remove duplicates → re-run info() to confirm the table is clean.
Watch out: If astype(int) errors, the column probably still has stray text or blanks (like a “–” or an empty cell). Clean those first, or use pd.to_numeric(df[col], errors='coerce') to turn bad values into NaN you can then handle.
Q. A price column shows up as object in df.dtypes. What does that tell you?
✍️ Practice
- Add a fully duplicate row to a small table, confirm it with
df.duplicated().sum(), then remove it. - Make a column of quoted numbers (text), convert it with
astype(int), and print its new.dtype.
🏠 Homework
- On a real CSV, run
df.dtypes, fix at least one column that has the wrong type, then remove any duplicate rows. Report how many duplicates you found.