Artificial Intelligence PL - 25 sie 2025

How an LLM Reads a Sentence: An In-Depth Look at Tokenization

LLM tokenisation is the fundamental process that determines how an AI reads a sentence. By breaking down text into smaller units, this mechanism allows models to process complex linguistic data efficiently.

Language models such as GPT, BERT, or Llama do not process text the same way humans do. Before a single neuron is activated, they must convert text into understandable numerical units: tokens.

Behind this seemingly simple step lies a major challenge for performance, generalisation, and multilingual support. This article provides a technical deep dive into how modern tokenisers work, from Byte-Pair Encoding to SentencePiece.

The Essential Role of LLM Tokenisation

Language models do not manipulate raw text directly; they operate on tokens—the elementary units of text that are converted into integers. These tokens can represent:

whole words („cat”)
sub-words („ation” in „ration”)
individual characters („x”, „@”)
byte-pairs („ĠThe”, „##ing”)

Key Takeaway: A token is not necessarily a word; it can be a complete word, a sub-word, a character, or even a combination of bytes.

Example:

„internationalization” → [„international”, „##ization”] (WordPiece)
„internationalization” → [„inter”, „national”, „ization”] (BPE)

The text is broken down into tokens, and each token is then mapped to an integer ID using a learned vocabulary.

Tokenisation: Why do we tokenise text?

To reduce vocabulary size to a few tens of thousands of entries
To handle out-of-vocabulary (OOV) words effectively
To enable linguistically robust and multilingual processing
To maintain manageable sequence lengths (due to the quadratic cost of the attention mechanism)

Let’s look at the following sentence: „I’m testing tokenization algorithms.”

Model/Tokeniser	Result
GPT-2 (BPE)	[’I’, '’’, 'm’, ’ testing’, ’ token’, 'ization’, ’ algorithms’, ’.’]
BERT (WordPiece)	['[CLS]’, 'i’, ”’, 'm’, 'testing’, 'token’, '##ization’, 'algorithms’, ’.’, '[SEP]’]
Llama (SentencePiece)	[’▁I’, '’’, 'm’, '▁testing’, '▁token’, 'ization’, '▁algorithms’, ’.’]

The underscores (▁) in SentencePiece indicate word boundaries.
The „##” prefixes in WordPiece indicate a sub-word.
GPT-2 uses encoded spaces (’ testing’) to preserve segmentation.

Overview of primary tokenisation methods

✅ BPE (Byte-Pair Encoding)

Used by: GPT-2, GPT-3, GPT-J, CLIP…
Principle: Starts with a minimal vocabulary (often characters or byte-level tokens) and learns to merge the most frequent pairs of symbols found in the corpus.

Process:

Decompose each word into characters (or bytes)
Count the most frequent adjacent pairs
Merge these pairs into a single token
Repeat until the target vocabulary size is reached (typically between 30k and 50k)

Simplified example with „lower”:
Initial: [’l’, 'o’, 'w’, 'e’, 'r’]
Merge: [’l’, 'ow’, 'e’, 'r’]
Final merge: [’l’, 'ow’, 'er’]

Advantages

Excellent trade-off between granularity and compactness
Robust against neologisms, typos, and proper nouns
Stable vocabulary once trained

Disadvantages

Dependent on the training corpus
Lacks linguistic intuition
Segmentation can sometimes be counter-intuitive

✅ WordPiece (BERT)

Used by: BERT, RoBERTa, DistilBERT, Albert…
Principle: Inspired by BPE, but uses a probabilistic algorithm based on the likelihood of the corpus.

Unlike BPE, WordPiece seeks to maximise the global probability of the corpus according to a language model.

Comparative example:

Word	WordPiece tokens
Playing	[’play’, '##ing’]
Football	[’foot’, '##ball’]
Again	[’again’]

Practical implementation with Hugging Face

Python Example:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Unbelievable results!")
print(tokens)
# ['un', '##bel', '##iev', '##able', 'results', '!']

BERT Specifics

[CLS]: Special token added at the beginning of a sequence (used for classification)
[SEP]: Separator token between two sentences (used for Next Sentence Prediction)
Casing: Uncased models convert text to lowercase and remove accents; cased models preserve them

SentencePiece

Used by: Llama, T5, mT5, XLNet, ByT5, ALBERT (variants), MarianMT, etc.
Principle: SentencePiece is a self-sufficient tokeniser trained directly on raw text. It does not assume that words are separated by spaces.

How it works

It makes no assumptions about space separation (essential for agglutinative or logographic languages)
It can operate at the character, byte, or Unicode codepoint level
It implements two algorithms:
- Unigram Language Model (default in T5)
- Modified BPE (optional)
Trained via a joint probability of sub-tokens, maximising probability while minimising vocabulary complexity

Example with „The cat sat.”:

['▁The', '▁cat', '▁sat', '.']

The ▁ character represents a space integrated into the token, allowing for the exact reconstruction of the original text.

Multilingual Example

Japanese phrase: „私はAIが好きです”
→ ['▁私', 'は', 'AI', 'が', '好', 'き', 'です']

Advantages

Language-independent, regardless of writing conventions
Perfectly handles languages without spaces
Accurate representation of raw text (exact reversibility)

Disadvantages

Tokens are less human-readable
Segmentation can sometimes be counter-intuitive
Unigram LM is more computationally expensive to train

Practical implementation with SentencePiece

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("m.model")

tokens = sp.encode("The cat sat.", out_type=str)
print(tokens)
# ['▁The', '▁cat', '▁sat', '.']

Typical Use Cases

Model	Method Used	Characteristics
T5	Unigram LM	Full text-to-text, position agnostic
Llama	BPE in SentencePiece	Long context, multilingual
mT5	Unigram	101 languages, balanced performance
ByT5	Byte-level	Directly on UTF-8 (maximum robustness)

Edge Cases in Tokenisation

Consider a variable identifier or a rare name:

input = "z3r0C00l@#"

GPT-2:

['z', '3', 'r', '0', 'C', '00', 'l', '@', '#']

BERT:

['z', '##3', '##r', '##0', '##c', '##00', '##l', '@', '#']

The more unfamiliar a word is, the more it will be fragmented—which impacts:

Sequence length (and consequently the quadratic cost of the attention mechanism)
Semantic reasoning capabilities of the model
Cross-attention in encoder-decoder architectures

Impact on Performance

Overly granular tokenisation: Leads to longer sequences and increased memory consumption
Overly coarse tokenisation: Results in loss of generalisation and an increase in OOV (Out-Of-Vocabulary) instances

Newer tokenisers, such as tiktoken (OpenAI), optimise this trade-off by combining BPE with Unicode heuristics.

Tokenisation – Python Example

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokens = tokenizer.tokenize("The cat sat.")
print(tokens)  
# ['The', 'Ġcat', 'Ġsat', '.']

The Ġ character encodes a space. The vocabulary is trained to recognise frequent patterns within the original corpus.

Conclusion

Tokenisation is far more than mere text pre-processing: it is the fundamental gateway to a language model. The choice of algorithm (BPE, WordPiece, SentencePiece) directly influences how the model perceives and encodes information.