Build Your First AI Model
Train a real machine-learning model end to end with scikit-learn — and test how good it is.
What you will learn
- Split data into training and test sets
- Train a model with .fit() and predict with .predict()
- Measure accuracy honestly
The five steps of every ML project
- Get data — features (X) and labels (y).
- Split — keep some data aside for testing.
- Train — the model learns from the training data (
.fit()). - Predict — ask the model about new data (
.predict()). - Evaluate — measure how often it is right.
A complete, tiny example
We will predict whether a student passes from the hours they studied — using a real scikit-learn model instead of our hand-made rule. scikit-learn is a free Python library (a ready-made toolbox of code) for machine learning, and Logistic Regression is one of its simplest models — despite the long name, it just learns a yes/no boundary, like “below this many hours → fail, above it → pass”.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# 1) Data: features (hours) and labels (1 = pass, 0 = fail)
X = [[1],[2],[3],[4],[5],[6],[7],[8]]
y = [ 0, 0, 0, 0, 1, 1, 1, 1]
# 2) Split: 75% to train on, 25% held back to test honestly
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=0)
# 3) Train
model = LogisticRegression()
model.fit(X_train, y_train)
# 4) Predict for a brand-new student who studied 7 hours
print('Studied 7h ->', model.predict([[7]])[0])
# 5) Evaluate on the held-back test data
print('Accuracy:', model.score(X_test, y_test))Note: Output: Studied 7h -> 1 Accuracy: 1.0 The model predicts 1 (pass) for 7 hours, and got 100% of the held-back test examples right. The computer found the “study more → pass” pattern by itself — we never wrote the rule.
Why split the data?
If we tested the model on the same rows it learned from, it could just memorise them and look perfect while being useless on new students. Holding back a test set checks it learned the real pattern, not the answers.
Watch out: Reporting accuracy on training data is the most common beginner mistake. Always score on data the model has not seen.
Tip: Notice the pattern is always the same: create the model, .fit(X_train, y_train) to learn, .predict(...) to guess, .score(...) to grade. Swap in different data or a different model and these four steps stay identical.
Q. Why do we keep a separate test set?
✍️ Practice
- Change the new student to 2 hours and predict — does the model say fail (0)?
- Add more rows to X and y, re-run, and see if accuracy stays high.
🏠 Homework
- Reuse this five-step pipeline for a different idea, e.g. predict “buy / not buy” from a price number. Write the X and y.