ProjectCore· 150 min read

Project: Analyse a Dataset End to End

Put it all together — load a real dataset, clean it, explore it, chart it, and write a clear conclusion.

What you will learn

  • Run the full data-science workflow on one dataset
  • Clean and explore data, then visualise a finding
  • Write a conclusion someone could act on

What you will build

This is the capstone. You will take a dataset of employees — their department, years of experience and salary — and run the whole workflow you have learned: load → clean → explore → visualise → conclude. The question we will answer: “Does experience relate to salary, and which department pays most?”

Use the dataset below (save it as employees.csv), or bring your own and adapt the steps.

employees.csv — note the one blank salary (Meera)
name,department,experience,salary
Asha,Engineering,5,90
Ravi,Engineering,2,60
Meera,Sales,8,
Karan,Sales,3,45
Sara,Engineering,10,130
Dev,Sales,6,70
Nisha,Engineering,1,40
Amit,Sales,4,45

Note: Output: (No output — this is the raw CSV file. Eight employees across two departments. Meera’s salary is blank, which we will handle during cleaning.)

Step 1 — Load and first look

Read the file and run your standard health check.

Load the data and inspect it
import pandas as pd

df = pd.read_csv('employees.csv')
print(df.shape)
print(df.head())
df.info()

Note: Output: (8, 4) name department experience salary 0 Asha Engineering 5 90.0 1 Ravi Engineering 2 60.0 2 Meera Sales 8 NaN ... salary 7 non-null float64 8 rows, 4 columns. info() confirms salary has only 7 non-null values — Meera’s is missing. Found the problem before doing any analysis.

Step 2 — Clean

Fill the missing salary with the average salary, so no row is dropped.

Fill the blank salary with the column mean
df['salary'] = df['salary'].fillna(df['salary'].mean())
print(df['salary'].isnull().sum(), 'missing salaries now')

Note: Output: 0 missing salaries now Meera’s blank is filled with the average of the other salaries, so the column is complete. Zero missing values — ready to analyse.

Step 3 — Explore with groupby

Answer the “which department pays most?” half of the question.

Average salary per department
print(df.groupby('department')['salary'].mean())

Note: Output: department Engineering 77.50 Sales 55.75 Name: salary, dtype: float64 Engineering pays more on average (77.5 vs 55.75). One line of groupby answered a real question.

Step 4 — A statistic and a chart

Now the “does experience relate to salary?” half. Check the correlation, then make it visible with a scatter plot.

Correlation plus a scatter plot of the relationship
import matplotlib.pyplot as plt

print('correlation:', round(df['experience'].corr(df['salary']), 2))

plt.scatter(df['experience'], df['salary'])
plt.title('Experience vs Salary')
plt.xlabel('Years of experience')
plt.ylabel('Salary (thousands)')
plt.show()

Note: Output: correlation: 0.97 (A scatter plot of dots climbing from lower-left to upper-right.) A correlation of 0.97 is very strong and positive — and the rising dots show it clearly: more experience goes with higher salary in this data.

Step 5 — Conclude

The final step is the most important: say what you found, in plain words a manager could act on.

  • Engineering pays more on average than Sales (77.5k vs 55.75k).
  • Experience and salary are strongly linked (correlation 0.97) — each extra year tends to come with higher pay.
  • We filled one missing salary with the average, so treat that single row with mild caution.

Your tasks

  1. Load the CSV and run shape, head and info.
  2. Clean it: handle the missing salary (fill or drop) and confirm zero blanks.
  3. Use groupby to compare a number across a category.
  4. Compute one correlation and draw one chart that shows a finding.
  5. Write a 3–4 line conclusion in plain language.

Tip: Work one step at a time and print after each — load, then look; clean, then check; explore, then chart. A finished, honest analysis of a small dataset beats a fancy half-broken one.

Watch out: Be honest about your cleaning choices. You filled a missing salary with an average — that is a reasonable guess, not a fact. Always note assumptions like this in your conclusion.

Q. In this project, what is the correct order of the workflow steps?

Answer: Every data-science project follows the same order: get/load the data, clean it, explore it, visualise a finding, then conclude in plain language.

Note: When this works you have completed a full data-science project — load, clean, explore, visualise and conclude — exactly what the job is. Save the notebook, write up your conclusion, and add it to your portfolio!

✍️ Practice

  1. Run the whole project on the employees dataset and produce your scatter plot.
  2. Swap in your own CSV and repeat all five steps, ending with a written conclusion.

🏠 Homework

  1. Write a short report (6–8 lines) on your dataset: the question you asked, what you cleaned, the chart you made, and your conclusion. Include the chart and add the project to your GitHub.
Want to learn this with a mentor?

CodingClave runs guided, project-based training (28-day, 45-day & 6-month batches).

Explore Training →