PandasCore· 35 min read

Cleaning Data: Duplicates & Wrong Types

Two more cleaning jobs: remove duplicate rows that distort your totals, and fix columns stored as the wrong type.

What you will learn

  • Find and remove duplicate rows
  • Spot a column with the wrong type
  • Convert text to numbers with astype

Two more messes to fix

After missing values, the next two cleaning jobs are duplicate rows (the same record twice) and wrong types (a number stored as text). Both quietly ruin your results if you miss them.

Here is the order we will follow in this lesson:

  1. Check the types of every column with dtypes, to spot any number that is secretly stored as text.
  2. Remove duplicate rows — count them with duplicated().sum(), then delete the exact repeats with drop_duplicates().
  3. Fix the wrong type — convert the text column to real numbers with astype(int) so maths works.
  4. Confirm — re-check dtypes (and your row count) to make sure the table is now clean.

Here is a table with both problems to start from:

A table with a duplicate row and a text price column
import pandas as pd

df = pd.DataFrame({
    'order': [1, 2, 2, 3],
    'price': ['100', '200', '200', '150']   # prices are text, not numbers!
})
print(df.dtypes)

Note: Output: order int64 price object dtype: object dtypes shows each column’s type. order is int64 (numbers), but price is object — and in Pandas object means text. The quotes around the numbers in the data were the clue. (The last line, dtype: object, is just describing the little list of types itself — ignore it; the two lines above it are the column types you care about.) Also notice order 2 appears twice — that is the duplicate row we will remove next.

Remove duplicate rows

Duplicate rows inflate counts, sums and averages. Find them with duplicated().sum() and remove exact repeats with drop_duplicates().

Count and remove duplicate rows
print('duplicates:', df.duplicated().sum())

df = df.drop_duplicates()
print('rows now:', len(df))

Note: Output: duplicates: 1 rows now: 3 There was exactly 1 fully repeated row (the second order 2). drop_duplicates removed it, leaving 3 unique rows. Our totals will now be correct.

Fix a wrong type

The price column is text, so price.mean() would fail. Convert it to numbers with astype before doing any maths.

Convert a text column to whole numbers
df['price'] = df['price'].astype(int)
print('dtype now:', df['price'].dtype)
print('average price:', df['price'].mean())

Note: Output: dtype now: int64 average price: 150.0 price is now int64 (real numbers), so .mean() works and gives 150.0 across the three unique orders. Before the conversion, this line would have raised an error.

ProblemFind itFix it
Duplicate rowsdf.duplicated().sum()drop_duplicates()
Text instead of numberdf.dtypes (shows object)astype(int) or pd.to_numeric()
Number instead of textdf.dtypesastype(str)

Tip: A good overall cleaning order: info() to spot problems → fix missing values → fix wrong types → remove duplicates → re-run info() to confirm the table is clean.

Watch out: If astype(int) errors, the column probably still has stray text or blanks (like a “–” or an empty cell). Clean those first, or use pd.to_numeric(df[col], errors='coerce') to turn bad values into NaN you can then handle.

Q. A price column shows up as object in df.dtypes. What does that tell you?

Answer: An object dtype on what should be numbers means the values are stored as text. Convert with astype(int) (or pd.to_numeric) before doing maths.

✍️ Practice

  1. Add a fully duplicate row to a small table, confirm it with df.duplicated().sum(), then remove it.
  2. Make a column of quoted numbers (text), convert it with astype(int), and print its new .dtype.

🏠 Homework

  1. On a real CSV, run df.dtypes, fix at least one column that has the wrong type, then remove any duplicate rows. Report how many duplicates you found.
Want to learn this with a mentor?

CodingClave runs guided, project-based training (28-day, 45-day & 6-month batches).

Explore Training →