Project›Core· 150 min read

Project: Analyse a Dataset End to End

Put it all together — load a real dataset, clean it, explore it, chart it, and write a clear conclusion.

What you will learn

Run the full data-science workflow on one dataset
Clean and explore data, then visualise a finding
Write a conclusion someone could act on

What you will build

This is the capstone. You will take a dataset of employees — their department, years of experience and salary — and run the whole workflow you have learned: load → clean → explore → visualise → conclude. The question we will answer: “Does experience relate to salary, and which department pays most?”

Use the dataset below (save it as employees.csv), or bring your own and adapt the steps.

employees.csv — note the one blank salary (Meera)

name,department,experience,salary
Asha,Engineering,5,90
Ravi,Engineering,2,60
Meera,Sales,8,
Karan,Sales,3,45
Sara,Engineering,10,130
Dev,Sales,6,70
Nisha,Engineering,1,40
Amit,Sales,4,45

Note: Output: (No output — this is the raw CSV file. Eight employees across two departments. Meera’s salary is blank, which we will handle during cleaning.)

Step 1 — Load and first look

Read the file and run your standard health check.

Load the data and inspect it

import pandas as pd

df = pd.read_csv('employees.csv')
print(df.shape)
print(df.head())
df.info()

Note: Output: (8, 4) name department experience salary 0 Asha Engineering 5 90.0 1 Ravi Engineering 2 60.0 2 Meera Sales 8 NaN ... salary 7 non-null float64 8 rows, 4 columns. info() confirms salary has only 7 non-null values — Meera’s is missing. Found the problem before doing any analysis.

Step 2 — Clean

Fill the missing salary with the average salary, so no row is dropped.

Fill the blank salary with the column mean

df['salary'] = df['salary'].fillna(df['salary'].mean())
print(df['salary'].isnull().sum(), 'missing salaries now')

Note: Output: 0 missing salaries now Meera’s blank is filled with the average of the other salaries, so the column is complete. Zero missing values — ready to analyse.

Step 3 — Explore with groupby

Answer the “which department pays most?” half of the question.

Average salary per department

print(df.groupby('department')['salary'].mean())

Note: Output: department Engineering 77.50 Sales 55.75 Name: salary, dtype: float64 Engineering pays more on average (77.5 vs 55.75). One line of groupby answered a real question.

Step 4 — A statistic and a chart

Now the “does experience relate to salary?” half. Check the correlation, then make it visible with a scatter plot.

Correlation plus a scatter plot of the relationship

import matplotlib.pyplot as plt

print('correlation:', round(df['experience'].corr(df['salary']), 2))

plt.scatter(df['experience'], df['salary'])
plt.title('Experience vs Salary')
plt.xlabel('Years of experience')
plt.ylabel('Salary (thousands)')
plt.show()

Note: Output: correlation: 0.97 (A scatter plot of dots climbing from lower-left to upper-right.) A correlation of 0.97 is very strong and positive — and the rising dots show it clearly: more experience goes with higher salary in this data.

Step 5 — Conclude

The final step is the most important: say what you found, in plain words a manager could act on.

Engineering pays more on average than Sales (77.5k vs 55.75k).
Experience and salary are strongly linked (correlation 0.97) — each extra year tends to come with higher pay.
We filled one missing salary with the average, so treat that single row with mild caution.

Your tasks

Load the CSV and run shape, head and info.
Clean it: handle the missing salary (fill or drop) and confirm zero blanks.
Use groupby to compare a number across a category.
Compute one correlation and draw one chart that shows a finding.
Write a 3–4 line conclusion in plain language.

Tip: Work one step at a time and print after each — load, then look; clean, then check; explore, then chart. A finished, honest analysis of a small dataset beats a fancy half-broken one.

Watch out: Be honest about your cleaning choices. You filled a missing salary with an average — that is a reasonable guess, not a fact. Always note assumptions like this in your conclusion.

Q. In this project, what is the correct order of the workflow steps?

Answer: Every data-science project follows the same order: get/load the data, clean it, explore it, visualise a finding, then conclude in plain language.

Note: When this works you have completed a full data-science project — load, clean, explore, visualise and conclude — exactly what the job is. Save the notebook, write up your conclusion, and add it to your portfolio!

✍️ Practice

Run the whole project on the employees dataset and produce your scatter plot.
Swap in your own CSV and repeat all five steps, ending with a written conclusion.

🏠 Homework

Write a short report (6–8 lines) on your dataset: the question you asked, what you cleaned, the chart you made, and your conclusion. Include the chart and add the project to your GitHub.