Getting Data with SQL
Most real data lives in a database, not a CSV — so a data scientist pulls it out with SQL, the language for asking databases questions.
What you will learn
- Read a SELECT query with WHERE, ORDER BY and a JOIN
- Group and summarise rows with GROUP BY
- Run a SQL query from Python straight into a DataFrame
Why SQL is the #1 data-scientist skill
CSV files are the practice data of a course. In a real company the data lives in a database — a giant, organised store of tables that many people share. To get data out of it, you do not open a file; you ask a question in a language called SQL (Structured Query Language, said “sequel” or “ess-cue-el”). Almost every data-science job interview tests SQL, because it is how you reach the data in the first place.
The good news: SQL reads almost like English, and you already think this way from Pandas. “Show me the columns I want, from this table, where some condition is true, grouped by a category” — that is one SQL query.
The shape of a SELECT query
Reading data is done with a SELECT statement. It is built from a few keywords, always written in this order. Imagine a sales table with columns product, region and amount.
- SELECT — which columns you want (or
*for all of them). - FROM — which table to read from.
- WHERE — keep only rows that match a condition (this is filtering).
- ORDER BY — sort the result.
- LIMIT — keep only the first few rows.
Here is a query that asks: “give me the product and amount of every sale over 500 in the North region, biggest first.”
SELECT product, amount
FROM sales
WHERE region = 'North' AND amount > 500
ORDER BY amount DESC
LIMIT 5;Note: Output:
product amount
Monitor 900
Laptop 820
Keyboard 560
The database returned only the columns we asked for, only North rows above 500, sorted from biggest to smallest, capped at 5. DESC means descending (high to low); leave it off for low-to-high.
GROUP BY — the “per category” question in SQL
You met groupby in Pandas; SQL has the very same idea, written GROUP BY. You pick a category to group on, then an aggregate function — SUM, AVG, COUNT, MAX — to squeeze each group into one number. This query answers “what is the total amount sold per region?”
SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region
ORDER BY total_sales DESC;Note: Output:
region total_sales
South 970
North 680
East 520
GROUP BY region split the rows into one pile per region, SUM(amount) added up each pile, and AS total_sales gave that new column a friendly name. South sold the most. This is exactly df.groupby('region')['amount'].sum() from the Pandas lesson — same idea, different language.
JOIN — combining two tables
Just like the Pandas merge lesson, real answers often need two tables stitched together on a shared column. SQL does this with JOIN ... ON. Suppose a customers table maps each cust_id to a city; this query brings the city onto every order.
SELECT orders.amount, customers.city
FROM orders
JOIN customers ON orders.cust_id = customers.cust_id
WHERE customers.city = 'Pune';Note: Output:
amount city
250 Pune
300 Pune
JOIN customers ON orders.cust_id = customers.cust_id matched each order to its customer by the shared cust_id, so we could then keep only Pune orders. This is the SQL twin of orders.merge(customers, on='cust_id').
Running SQL from Python into a DataFrame
Here is the punchline that ties SQL to everything you have learned: Pandas can run a SQL query for you and hand back the result as a DataFrame, ready to clean, chart and model. The function is pd.read_sql. (Here we use SQLite, a tiny database that lives in a single file, so the example needs no setup.)
import pandas as pd
import sqlite3
# Connect to a database file (SQLite needs no server)
conn = sqlite3.connect('shop.db')
query = '''
SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region
'''
df = pd.read_sql(query, conn) # run the SQL, get a DataFrame back
print(df)Note: Output:
region total_sales
0 East 520
1 North 680
2 South 970
The database did the grouping, and the answer arrived as an ordinary DataFrame. From here you would chart it, merge it, or feed it to a model — the SQL just replaced read_csv as the way you got the data in.
| SQL | The Pandas equivalent you know | Means |
|---|---|---|
SELECT col | df[['col']] | Pick columns |
WHERE x > 5 | df[df['x'] > 5] | Filter rows |
ORDER BY x | df.sort_values('x') | Sort |
GROUP BY x | df.groupby('x') | Group per category |
JOIN ... ON | df.merge(other, on=...) | Combine tables |
Tip: Learn SQL and Pandas as two accents of the same language. If you can think “SELECT these columns WHERE this is true, GROUP BY that”, you can write it in either one — and real jobs use both, often in the same project.
Watch out: SQL string comparisons are case- and quote-sensitive: WHERE region = 'North' needs single quotes around text and must match the stored value exactly. region = North (no quotes) makes the database look for a column called North and error.
Q. In SQL, which clause keeps only the rows that match a condition (like Pandas filtering)?
✍️ Practice
- Write a SELECT query (on paper or in SQLite) that returns the
productandamountof all sales over 200, sorted from biggest to smallest. - Write a GROUP BY query that returns the average
amountperproduct, then say which Pandas line does the same thing.
🏠 Homework
- Create a small SQLite database file with a
salestable (a few rows), then usepd.read_sqlto run a GROUP BY query and load the result into a DataFrame. Print the DataFrame and write one sentence about what it shows.