Language & Generative AI›Extra· 40 min read

Natural Language Processing: Turning Text into Numbers

Computers cannot read words — NLP is the bag of tricks that turns text into numbers a model can learn from.

What you will learn

Explain why text must become numbers
Tokenize a sentence and build a bag-of-words
Define embeddings and run a tiny sentiment example

The core problem: words are not numbers

Natural Language Processing (NLP) is the field of AI that works with human language — text and speech. It powers spam filters, translation, search, voice assistants and chatbots. But there is one big obstacle: machine-learning models only understand numbers, and language is words. So almost everything in NLP is about one task — turning text into numbers the model can learn from.

Step 1 — tokenizing (chop text into pieces)

Tokenizing means breaking text into small units called tokens — usually words. It is the first step in nearly every NLP system.

Tokenizing: chop a sentence into a list of word-tokens

sentence = "I love this movie"
tokens = sentence.lower().split()      # split on spaces
print(tokens)

Note: Output: ['i', 'love', 'this', 'movie'] Four words became four tokens. Real tokenizers are smarter (they handle punctuation and split rare words into pieces), but the idea is the same: break text into countable units.

Step 2 — bag-of-words (count the tokens)

The simplest way to turn tokens into numbers is a bag-of-words: list every word in your vocabulary, then for each sentence count how many times each word appears. Order is ignored — you just throw the words in a bag and count.

Bag-of-words: each sentence becomes a row of word-counts

from sklearn.feature_extraction.text import CountVectorizer

reviews = ["I love this movie", "I hate this movie"]
vec = CountVectorizer()
counts = vec.fit_transform(reviews)

print(vec.get_feature_names_out())   # the vocabulary
print(counts.toarray())              # each review as word-counts

Note: Output: ['hate' 'love' 'movie' 'this'] [[0 1 1 1] [1 0 1 1]] The vocabulary has 4 words. Review 1 (“I love this movie”) has love=1, hate=0; review 2 flips that. Now each sentence is a row of numbers — exactly what a model can train on.

Watch out: Bag-of-words throws away word order, so “dog bites man” and “man bites dog” look identical to it. That loss of meaning is a real limitation — and the reason smarter methods (embeddings and transformers) were invented.

Step 3 — embeddings (numbers that capture meaning)

A better trick is word embeddings: each word is turned into a short list of numbers (a vector) learned so that words with similar meanings get similar numbers. “king” and “queen” land close together; “king” and “banana” land far apart. Embeddings are how modern NLP captures meaning, not just word counts — and they are the foundation under translation and LLMs.

Embeddings place similar-meaning words close together in number-space

# Cartoon embeddings: each word is a tiny vector of "meaning" numbers
embeddings = {
    'king':   [0.9, 0.7],
    'queen':  [0.9, 0.6],
    'banana': [0.1, 0.2],
}

def distance(a, b):
    return sum((x - y) ** 2 for x, y in zip(a, b)) ** 0.5

print('king–queen :', round(distance(embeddings['king'], embeddings['queen']), 2))
print('king–banana:', round(distance(embeddings['king'], embeddings['banana']), 2))

Note: Output: king–queen : 0.1 king–banana: 0.94 “king” and “queen” are close (distance 0.1) because they mean similar things; “king” and “banana” are far apart (0.94). Real embeddings use hundreds of numbers and are learned from billions of words — but this is exactly the idea.

A real NLP task: sentiment analysis

Sentiment analysis decides whether a piece of text is positive or negative. With bag-of-words plus the supervised pipeline from Unit 4, we can build one in a few lines — it is the spam classifier wearing a different hat.

Sentiment analysis = text → numbers → the same supervised classifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

texts  = ['i love this', 'great film', 'i hate this', 'terrible movie']
labels = ['positive',    'positive',   'negative',    'negative']

vec = CountVectorizer()
X = vec.fit_transform(texts)
model = MultinomialNB().fit(X, labels)

print(model.predict(vec.transform(['i love it']))[0])
print(model.predict(vec.transform(['hate it']))[0])

Note: Output: positive negative The model never saw “i love it” or “hate it” exactly, but it learned that words like “love”/“great” signal positive and “hate”/“terrible” signal negative. Same five-step ML pipeline — only the data is text.

Tip: Remember the through-line: NLP is machine learning applied to language, and the whole trick is turning text into numbers — first by counting (bag-of-words), then by meaning (embeddings). Transformers and LLMs, next, take embeddings much further.

Q. Why must text be converted to numbers before a model can use it?

Answer: Models learn by doing arithmetic, so every input must be numeric. NLP turns text into numbers via tokenizing, counting (bag-of-words) and embeddings.

✍️ Practice

Tokenize the sentence “AI is fun and useful” and count how many tokens it has.
Add the reviews “awful acting” (negative) and “loved every minute” (positive) to the sentiment example and re-test.

🏠 Homework

Explain in your own words why bag-of-words cannot tell “the movie was good, not bad” from “the movie was bad, not good”.