Project: Analyse a Dataset End to End
Put it all together — load a real dataset, clean it, explore it, chart it, and write a clear conclusion.
What you will learn
- Run the full data-science workflow on one dataset
- Clean and explore data, then visualise a finding
- Write a conclusion someone could act on
What you will build
This is the capstone. You will take a dataset of employees — their department, years of experience and salary — and run the whole workflow you have learned: load → clean → explore → visualise → conclude. The question we will answer: “Does experience relate to salary, and which department pays most?”
Use the dataset below (save it as employees.csv), or bring your own and adapt the steps.
name,department,experience,salary
Asha,Engineering,5,90
Ravi,Engineering,2,60
Meera,Sales,8,
Karan,Sales,3,45
Sara,Engineering,10,130
Dev,Sales,6,70
Nisha,Engineering,1,40
Amit,Sales,4,45Note: Output: (No output — this is the raw CSV file. Eight employees across two departments. Meera’s salary is blank, which we will handle during cleaning.)
Step 1 — Load and first look
Read the file and run your standard health check.
import pandas as pd
df = pd.read_csv('employees.csv')
print(df.shape)
print(df.head())
df.info()Note: Output: (8, 4) name department experience salary 0 Asha Engineering 5 90.0 1 Ravi Engineering 2 60.0 2 Meera Sales 8 NaN ... salary 7 non-null float64 8 rows, 4 columns. info() confirms salary has only 7 non-null values — Meera’s is missing. Found the problem before doing any analysis.
Step 2 — Clean
Fill the missing salary with the average salary, so no row is dropped.
df['salary'] = df['salary'].fillna(df['salary'].mean())
print(df['salary'].isnull().sum(), 'missing salaries now')Note: Output: 0 missing salaries now Meera’s blank is filled with the average of the other salaries, so the column is complete. Zero missing values — ready to analyse.
Step 3 — Explore with groupby
Answer the “which department pays most?” half of the question.
print(df.groupby('department')['salary'].mean())Note: Output: department Engineering 77.50 Sales 55.75 Name: salary, dtype: float64 Engineering pays more on average (77.5 vs 55.75). One line of groupby answered a real question.
Step 4 — A statistic and a chart
Now the “does experience relate to salary?” half. Check the correlation, then make it visible with a scatter plot.
import matplotlib.pyplot as plt
print('correlation:', round(df['experience'].corr(df['salary']), 2))
plt.scatter(df['experience'], df['salary'])
plt.title('Experience vs Salary')
plt.xlabel('Years of experience')
plt.ylabel('Salary (thousands)')
plt.show()Note: Output: correlation: 0.97 (A scatter plot of dots climbing from lower-left to upper-right.) A correlation of 0.97 is very strong and positive — and the rising dots show it clearly: more experience goes with higher salary in this data.
Step 5 — Conclude
The final step is the most important: say what you found, in plain words a manager could act on.
- Engineering pays more on average than Sales (77.5k vs 55.75k).
- Experience and salary are strongly linked (correlation 0.97) — each extra year tends to come with higher pay.
- We filled one missing salary with the average, so treat that single row with mild caution.
Your tasks
- Load the CSV and run
shape,headandinfo. - Clean it: handle the missing salary (fill or drop) and confirm zero blanks.
- Use
groupbyto compare a number across a category. - Compute one correlation and draw one chart that shows a finding.
- Write a 3–4 line conclusion in plain language.
Tip: Work one step at a time and print after each — load, then look; clean, then check; explore, then chart. A finished, honest analysis of a small dataset beats a fancy half-broken one.
Watch out: Be honest about your cleaning choices. You filled a missing salary with an average — that is a reasonable guess, not a fact. Always note assumptions like this in your conclusion.
Q. In this project, what is the correct order of the workflow steps?
Note: When this works you have completed a full data-science project — load, clean, explore, visualise and conclude — exactly what the job is. Save the notebook, write up your conclusion, and add it to your portfolio!
✍️ Practice
- Run the whole project on the employees dataset and produce your scatter plot.
- Swap in your own CSV and repeat all five steps, ending with a written conclusion.
🏠 Homework
- Write a short report (6–8 lines) on your dataset: the question you asked, what you cleaned, the chart you made, and your conclusion. Include the chart and add the project to your GitHub.