Pandas›Extra· 45 min read

Getting Data with SQL

Most real data lives in a database, not a CSV — so a data scientist pulls it out with SQL, the language for asking databases questions.

What you will learn

Read a SELECT query with WHERE, ORDER BY and a JOIN
Group and summarise rows with GROUP BY
Run a SQL query from Python straight into a DataFrame

Why SQL is the #1 data-scientist skill

CSV files are the practice data of a course. In a real company the data lives in a database — a giant, organised store of tables that many people share. To get data out of it, you do not open a file; you ask a question in a language called SQL (Structured Query Language, said “sequel” or “ess-cue-el”). Almost every data-science job interview tests SQL, because it is how you reach the data in the first place.

The good news: SQL reads almost like English, and you already think this way from Pandas. “Show me the columns I want, from this table, where some condition is true, grouped by a category” — that is one SQL query.

The shape of a SELECT query

Reading data is done with a SELECT statement. It is built from a few keywords, always written in this order. Imagine a sales table with columns product, region and amount.

SELECT — which columns you want (or * for all of them).
FROM — which table to read from.
WHERE — keep only rows that match a condition (this is filtering).
ORDER BY — sort the result.
LIMIT — keep only the first few rows.

Here is a query that asks: “give me the product and amount of every sale over 500 in the North region, biggest first.”

A SELECT query with filter, sort and limit

SELECT product, amount
FROM sales
WHERE region = 'North' AND amount > 500
ORDER BY amount DESC
LIMIT 5;

Note: Output: product amount Monitor 900 Laptop 820 Keyboard 560 The database returned only the columns we asked for, only North rows above 500, sorted from biggest to smallest, capped at 5. DESC means descending (high to low); leave it off for low-to-high.

GROUP BY — the “per category” question in SQL

You met groupby in Pandas; SQL has the very same idea, written GROUP BY. You pick a category to group on, then an aggregate function — SUM, AVG, COUNT, MAX — to squeeze each group into one number. This query answers “what is the total amount sold per region?”

Total sales per region with GROUP BY

SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region
ORDER BY total_sales DESC;

Note: Output: region total_sales South 970 North 680 East 520 GROUP BY region split the rows into one pile per region, SUM(amount) added up each pile, and AS total_sales gave that new column a friendly name. South sold the most. This is exactly df.groupby('region')['amount'].sum() from the Pandas lesson — same idea, different language.

JOIN — combining two tables

Just like the Pandas merge lesson, real answers often need two tables stitched together on a shared column. SQL does this with JOIN ... ON. Suppose a customers table maps each cust_id to a city; this query brings the city onto every order.

Join orders to customers, then filter by city

SELECT orders.amount, customers.city
FROM orders
JOIN customers ON orders.cust_id = customers.cust_id
WHERE customers.city = 'Pune';

Note: Output: amount city 250 Pune 300 Pune JOIN customers ON orders.cust_id = customers.cust_id matched each order to its customer by the shared cust_id, so we could then keep only Pune orders. This is the SQL twin of orders.merge(customers, on='cust_id').

Running SQL from Python into a DataFrame

Here is the punchline that ties SQL to everything you have learned: Pandas can run a SQL query for you and hand back the result as a DataFrame, ready to clean, chart and model. The function is pd.read_sql. (Here we use SQLite, a tiny database that lives in a single file, so the example needs no setup.)

Run a SQL query and load the result into Pandas

import pandas as pd
import sqlite3

# Connect to a database file (SQLite needs no server)
conn = sqlite3.connect('shop.db')

query = '''
    SELECT region, SUM(amount) AS total_sales
    FROM sales
    GROUP BY region
'''

df = pd.read_sql(query, conn)   # run the SQL, get a DataFrame back
print(df)

Note: Output: region total_sales 0 East 520 1 North 680 2 South 970 The database did the grouping, and the answer arrived as an ordinary DataFrame. From here you would chart it, merge it, or feed it to a model — the SQL just replaced read_csv as the way you got the data in.

SQL	The Pandas equivalent you know	Means
`SELECT col`	`df[['col']]`	Pick columns
`WHERE x > 5`	`df[df['x'] > 5]`	Filter rows
`ORDER BY x`	`df.sort_values('x')`	Sort
`GROUP BY x`	`df.groupby('x')`	Group per category
`JOIN ... ON`	`df.merge(other, on=...)`	Combine tables

Tip: Learn SQL and Pandas as two accents of the same language. If you can think “SELECT these columns WHERE this is true, GROUP BY that”, you can write it in either one — and real jobs use both, often in the same project.

Watch out: SQL string comparisons are case- and quote-sensitive: WHERE region = 'North' needs single quotes around text and must match the stored value exactly. region = North (no quotes) makes the database look for a column called North and error.

Q. In SQL, which clause keeps only the rows that match a condition (like Pandas filtering)?

Answer: WHERE filters rows to those matching a condition — the SQL twin of df[df['col'] > value] in Pandas. SELECT picks columns, ORDER BY sorts, GROUP BY groups.

✍️ Practice

Write a SELECT query (on paper or in SQLite) that returns the product and amount of all sales over 200, sorted from biggest to smallest.
Write a GROUP BY query that returns the average amount per product, then say which Pandas line does the same thing.

🏠 Homework

Create a small SQLite database file with a sales table (a few rows), then use pd.read_sql to run a GROUP BY query and load the result into a DataFrame. Print the DataFrame and write one sentence about what it shows.