Natural Language Processing: Turning Text into Numbers
Computers cannot read words — NLP is the bag of tricks that turns text into numbers a model can learn from.
What you will learn
- Explain why text must become numbers
- Tokenize a sentence and build a bag-of-words
- Define embeddings and run a tiny sentiment example
The core problem: words are not numbers
Natural Language Processing (NLP) is the field of AI that works with human language — text and speech. It powers spam filters, translation, search, voice assistants and chatbots. But there is one big obstacle: machine-learning models only understand numbers, and language is words. So almost everything in NLP is about one task — turning text into numbers the model can learn from.
Step 1 — tokenizing (chop text into pieces)
Tokenizing means breaking text into small units called tokens — usually words. It is the first step in nearly every NLP system.
sentence = "I love this movie"
tokens = sentence.lower().split() # split on spaces
print(tokens)Note: Output: ['i', 'love', 'this', 'movie'] Four words became four tokens. Real tokenizers are smarter (they handle punctuation and split rare words into pieces), but the idea is the same: break text into countable units.
Step 2 — bag-of-words (count the tokens)
The simplest way to turn tokens into numbers is a bag-of-words: list every word in your vocabulary, then for each sentence count how many times each word appears. Order is ignored — you just throw the words in a bag and count.
from sklearn.feature_extraction.text import CountVectorizer
reviews = ["I love this movie", "I hate this movie"]
vec = CountVectorizer()
counts = vec.fit_transform(reviews)
print(vec.get_feature_names_out()) # the vocabulary
print(counts.toarray()) # each review as word-countsNote: Output: ['hate' 'love' 'movie' 'this'] [[0 1 1 1] [1 0 1 1]] The vocabulary has 4 words. Review 1 (“I love this movie”) has love=1, hate=0; review 2 flips that. Now each sentence is a row of numbers — exactly what a model can train on.
Watch out: Bag-of-words throws away word order, so “dog bites man” and “man bites dog” look identical to it. That loss of meaning is a real limitation — and the reason smarter methods (embeddings and transformers) were invented.
Step 3 — embeddings (numbers that capture meaning)
A better trick is word embeddings: each word is turned into a short list of numbers (a vector) learned so that words with similar meanings get similar numbers. “king” and “queen” land close together; “king” and “banana” land far apart. Embeddings are how modern NLP captures meaning, not just word counts — and they are the foundation under translation and LLMs.
# Cartoon embeddings: each word is a tiny vector of "meaning" numbers
embeddings = {
'king': [0.9, 0.7],
'queen': [0.9, 0.6],
'banana': [0.1, 0.2],
}
def distance(a, b):
return sum((x - y) ** 2 for x, y in zip(a, b)) ** 0.5
print('king–queen :', round(distance(embeddings['king'], embeddings['queen']), 2))
print('king–banana:', round(distance(embeddings['king'], embeddings['banana']), 2))Note: Output: king–queen : 0.1 king–banana: 0.94 “king” and “queen” are close (distance 0.1) because they mean similar things; “king” and “banana” are far apart (0.94). Real embeddings use hundreds of numbers and are learned from billions of words — but this is exactly the idea.
A real NLP task: sentiment analysis
Sentiment analysis decides whether a piece of text is positive or negative. With bag-of-words plus the supervised pipeline from Unit 4, we can build one in a few lines — it is the spam classifier wearing a different hat.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
texts = ['i love this', 'great film', 'i hate this', 'terrible movie']
labels = ['positive', 'positive', 'negative', 'negative']
vec = CountVectorizer()
X = vec.fit_transform(texts)
model = MultinomialNB().fit(X, labels)
print(model.predict(vec.transform(['i love it']))[0])
print(model.predict(vec.transform(['hate it']))[0])Note: Output: positive negative The model never saw “i love it” or “hate it” exactly, but it learned that words like “love”/“great” signal positive and “hate”/“terrible” signal negative. Same five-step ML pipeline — only the data is text.
Tip: Remember the through-line: NLP is machine learning applied to language, and the whole trick is turning text into numbers — first by counting (bag-of-words), then by meaning (embeddings). Transformers and LLMs, next, take embeddings much further.
Q. Why must text be converted to numbers before a model can use it?
✍️ Practice
- Tokenize the sentence “AI is fun and useful” and count how many tokens it has.
- Add the reviews “awful acting” (negative) and “loved every minute” (positive) to the sentiment example and re-test.
🏠 Homework
- Explain in your own words why bag-of-words cannot tell “the movie was good, not bad” from “the movie was bad, not good”.