LLM tokenisation is the fundamental process that determines how an AI reads a sentence. By breaking down text into smaller units, this mechanism allows models to process complex linguistic data efficiently.
Language models such as GPT, BERT, or Llama do not process text the same way humans do. Before a single neuron is activated, they must convert text into understandable numerical units: tokens.
Behind this seemingly simple step lies a major challenge for performance, generalisation, and multilingual support. This article provides a technical deep dive into how modern tokenisers work, from Byte-Pair Encoding to SentencePiece.
The Essential Role of LLM Tokenisation
Language models do not manipulate raw text directly; they operate on tokens—the elementary units of text that are converted into integers. These tokens can represent:
- whole words (“cat”)
- sub-words (“ation” in “ration”)
- individual characters (“x”, “@”)
- byte-pairs (“ĠThe”, “##ing”)
Example:
“internationalization” → [“inter”, “national”, “ization”] (BPE)
The text is broken down into tokens, and each token is then mapped to an integer ID using a learned vocabulary.
Tokenisation: Why do we tokenise text?
- To reduce vocabulary size to a few tens of thousands of entries
- To handle out-of-vocabulary (OOV) words effectively
- To enable linguistically robust and multilingual processing
- To maintain manageable sequence lengths (due to the quadratic cost of the attention mechanism)
| Model/Tokeniser | Result |
|---|---|
| GPT-2 (BPE) | [‘I’, ‘’’, ‘m’, ‘ testing’, ‘ token’, ‘ization’, ‘ algorithms’, ‘.’] |
| BERT (WordPiece) | [‘[CLS]’, ‘i’, ”’, ‘m’, ‘testing’, ‘token’, ‘##ization’, ‘algorithms’, ‘.’, ‘[SEP]’] |
| Llama (SentencePiece) | [‘▁I’, ‘’’, ‘m’, ‘▁testing’, ‘▁token’, ‘ization’, ‘▁algorithms’, ‘.’] |
The “##” prefixes in WordPiece indicate a sub-word.
GPT-2 uses encoded spaces (‘ testing’) to preserve segmentation.
Overview of primary tokenisation methods
✅ BPE (Byte-Pair Encoding)
Used by: GPT-2, GPT-3, GPT-J, CLIP…
Principle: Starts with a minimal vocabulary (often characters or byte-level tokens) and learns to merge the most frequent pairs of symbols found in the corpus.
Process:
- Decompose each word into characters (or bytes)
- Count the most frequent adjacent pairs
- Merge these pairs into a single token
- Repeat until the target vocabulary size is reached (typically between 30k and 50k)
Initial: [‘l’, ‘o’, ‘w’, ‘e’, ‘r’]
Merge: [‘l’, ‘ow’, ‘e’, ‘r’]
Final merge: [‘l’, ‘ow’, ‘er’]
- Excellent trade-off between granularity and compactness
- Robust against neologisms, typos, and proper nouns
- Stable vocabulary once trained
- Dependent on the training corpus
- Lacks linguistic intuition
- Segmentation can sometimes be counter-intuitive
✅ WordPiece (BERT)
Used by: BERT, RoBERTa, DistilBERT, Albert…
Principle: Inspired by BPE, but uses a probabilistic algorithm based on the likelihood of the corpus.
Comparative example:
| Word | WordPiece tokens |
|---|---|
| Playing | [‘play’, ‘##ing’] |
| Football | [‘foot’, ‘##ball’] |
| Again | [‘again’] |
Practical implementation with Hugging Face
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Unbelievable results!")
print(tokens)
# ['un', '##bel', '##iev', '##able', 'results', '!']
BERT Specifics
- [CLS]: Special token added at the beginning of a sequence (used for classification)
- [SEP]: Separator token between two sentences (used for Next Sentence Prediction)
- Casing: Uncased models convert text to lowercase and remove accents; cased models preserve them
SentencePiece
Used by: Llama, T5, mT5, XLNet, ByT5, ALBERT (variants), MarianMT, etc.
Principle: SentencePiece is a self-sufficient tokeniser trained directly on raw text. It does not assume that words are separated by spaces.
How it works
- It makes no assumptions about space separation (essential for agglutinative or logographic languages)
- It can operate at the character, byte, or Unicode codepoint level
- It implements two algorithms:
- Unigram Language Model (default in T5)
- Modified BPE (optional)
- Trained via a joint probability of sub-tokens, maximising probability while minimising vocabulary complexity
['▁The', '▁cat', '▁sat', '.']
The ▁ character represents a space integrated into the token, allowing for the exact reconstruction of the original text.
Multilingual Example
→
['▁私', 'は', 'AI', 'が', '好', 'き', 'です']
- Language-independent, regardless of writing conventions
- Perfectly handles languages without spaces
- Accurate representation of raw text (exact reversibility)
- Tokens are less human-readable
- Segmentation can sometimes be counter-intuitive
- Unigram LM is more computationally expensive to train
Practical implementation with SentencePiece
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("m.model")
tokens = sp.encode("The cat sat.", out_type=str)
print(tokens)
# ['▁The', '▁cat', '▁sat', '.']
Typical Use Cases
| Model | Method Used | Characteristics |
|---|---|---|
| T5 | Unigram LM | Full text-to-text, position agnostic |
| Llama | BPE in SentencePiece | Long context, multilingual |
| mT5 | Unigram | 101 languages, balanced performance |
| ByT5 | Byte-level | Directly on UTF-8 (maximum robustness) |
Edge Cases in Tokenisation
Consider a variable identifier or a rare name:
input = "z3r0C00l@#"
GPT-2:
['z', '3', 'r', '0', 'C', '00', 'l', '@', '#']
BERT:
['z', '##3', '##r', '##0', '##c', '##00', '##l', '@', '#']
The more unfamiliar a word is, the more it will be fragmented—which impacts:
- Sequence length (and consequently the quadratic cost of the attention mechanism)
- Semantic reasoning capabilities of the model
- Cross-attention in encoder-decoder architectures
Impact on Performance
- Overly granular tokenisation: Leads to longer sequences and increased memory consumption
- Overly coarse tokenisation: Results in loss of generalisation and an increase in OOV (Out-Of-Vocabulary) instances
Tokenisation – Python Example
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokens = tokenizer.tokenize("The cat sat.")
print(tokens)
# ['The', 'Ġcat', 'Ġsat', '.']
The Ġ character encodes a space. The vocabulary is trained to recognise frequent patterns within the original corpus.
Conclusion
Tokenisation is far more than mere text pre-processing: it is the fundamental gateway to a language model. The choice of algorithm (BPE, WordPiece, SentencePiece) directly influences how the model perceives and encodes information.
- Better interpret LLM outputs
- Optimise training and inference
- Adapt the processing pipeline to domain-specific constraints
It is also a critical step for any fine-tuning project or custom LLM pre-training initiative.