Regular Expressions
Search, validate and extract patterns in text with the re module — the standard tool for cleaning data and checking input.
What you will learn
- Understand what a pattern is and why regex exists
- Search and extract text with re.search and re.findall
- Validate input like emails and phone numbers
What is a regular expression?
A regular expression (often shortened to regex) is a small pattern that describes what text should look like — for example “three digits, a dash, then four digits” for a phone number. Instead of writing lots of if checks by hand, you describe the shape once and let Python find, check, or pull out matching text for you.
Why bother? Two everyday jobs make regex worth learning: validation (does this look like a real email address?) and extraction (pull every price, date, or hashtag out of a block of text). Both come up constantly in form handling and data cleaning.
Python’s regex tool lives in the built-in re module, so every program that uses it starts with import re — no installing needed.
The building blocks of a pattern
A pattern is just text, but a few special symbols stand for “any kind of character like this.” Here are the handful you will reach for most often. Learn these and you can read the vast majority of real-world patterns:
| Pattern | Matches | Example |
|---|---|---|
\d | any single digit 0-9 | \d\d matches 42 |
\w | a letter, digit or underscore | \w matches a or 7 |
\s | a space, tab or newline | a gap between words |
. | any single character | a.c matches abc |
+ | one or more of the thing before it | \d+ matches 2025 |
* | zero or more of the thing before it | a* matches ~ or aaa~ |
{n} | exactly n of the thing before it | \d{4} matches 2025 |
[...] | any one character from the set | [aeiou] matches a vowel |
Tip: The backslash pairs (\d, \w, \s) are the workhorses. Read \d{3} as “exactly three digits” and \w+ as “one or more word characters.” That alone unlocks most patterns you will meet.
search: is the pattern in there?
The most common function is re.search(pattern, text). It scans the text for the first place the pattern appears. If it finds one it returns a match object (a truthy result you can ask questions of); if not, it returns None. Because None is falsy, you can test it straight inside an if.
import re
text = "Order 4521 shipped on 2026-06-13"
match = re.search(r"\d+", text) # first run of digits
if match:
print("Found:", match.group())
else:
print("No number found")Step by step: r"\d+" is the pattern — “one or more digits.” (The r before the quotes makes it a raw string, which stops Python from treating backslashes specially — always use r"..." for patterns.) re.search scans text left to right and stops at the first run of digits it finds, which is 4521. The match object is truthy, so the if runs and match.group() gives back the actual matched text. If no digits existed, search would return None and the else would run.
Note: Output: Found: 4521
findall: get every match
When you want all matches, not just the first, use re.findall(pattern, text). It returns a plain list of every piece of text the pattern matched — perfect for pulling all the numbers, dates, or words out of a string at once.
import re
text = "Call 0522-445566 or 011-998877 today"
numbers = re.findall(r"\d+", text)
print(numbers)The same \d+ pattern is applied across the whole string, and findall collects every run of digits into a list — here the four separate number groups from the two phone numbers. Notice it gives you a normal Python list, so you can loop over it, count it, or process it like any other list.
Note: Output: ['0522', '445566', '011', '998877']
A worked example: validating an email
Validation is the classic regex job. Below we check whether some text looks like an email address. We anchor the pattern with ^ (start of text) and $ (end of text) so the whole string must match, not just part of it.
import re
pattern = r"^\w+@\w+\.\w+$"
def is_email(text):
return re.search(pattern, text) is not None
print(is_email("asha@example.com")) # looks valid
print(is_email("not-an-email")) # no @ or domainReading the pattern in plain English: ^\w+ means “start with one or more word characters” (the name), then a literal @, then \w+ (the domain name), then \. — a real full stop (the backslash escapes the dot so it means an actual ., not “any character”), then \w+$ (the extension, ending the string). The function returns True when re.search finds a match and False when it returns None. So asha@example.com passes, while not-an-email has no @ and fails.
Note: Output: True False
Watch out: This simple email pattern is fine for teaching and basic checks, but real email validation is famously tricky. For production, validate the obvious shape with a simple regex like this and then confirm by sending a verification email — do not try to write one perfect pattern.
How to approach any pattern
Faced with a new validation or extraction task, work through it in this order:
- Describe the shape in plain words first — for example “four digits, a dash, two digits, a dash, two digits” for a date.
- Translate each part into symbols using the table above: that date becomes
\d{4}-\d{2}-\d{2}. - Wrap the pattern in a raw string:
r"\d{4}-\d{2}-\d{2}". - Pick the function:
re.searchto check or find the first,re.findallto collect them all. - Test it on a few real examples — including ones that should fail — to make sure it is neither too strict nor too loose.
Tip: Regex is powerful but quickly becomes unreadable. If a pattern grows long and cryptic, add a comment explaining it in words — or solve the problem with ordinary string methods (.split(), .startswith(), in) when those are clear enough.
Q. What does the pattern \d+ match?
✍️ Practice
- Use re.findall with
\d+to pull every number out of a sentence and print the list. - Write an is_phone(text) function that returns True only when the text is ten digits.
🏠 Homework
- Given a paragraph of text, use a regex to extract every hashtag (a
#followed by word characters) into a list.