actu-image
AI solutions - 25 sie 2025

How an LLM Reads a Sentence: An In-Depth Look at Tokenization

LLM tokenisation is the fundamental process that determines how an AI reads a sentence. By breaking down text into smaller units, this mechanism allows models to process complex linguistic data efficiently.

Language models such as GPT, BERT, or Llama do not process text the same way humans do. Before a single neuron is activated, they must convert text into understandable numerical units: tokens.

Behind this seemingly simple step lies a major challenge for performance, generalisation, and multilingual support. This article provides a technical deep dive into how modern tokenisers work, from Byte-Pair Encoding to SentencePiece.

The Essential Role of LLM Tokenisation

Language models do not manipulate raw text directly; they operate on tokens—the elementary units of text that are converted into integers. These tokens can represent:

  • whole words („cat”)
  • sub-words („ation” in „ration”)
  • individual characters („x”, „@”)
  • byte-pairs („ĠThe”, „##ing”)
Key Takeaway: A token is not necessarily a word; it can be a complete word, a sub-word, a character, or even a combination of bytes.

Example:

„internationalization” → [„international”, „##ization”] (WordPiece)
„internationalization” → [„inter”, „national”, „ization”] (BPE)

The text is broken down into tokens, and each token is then mapped to an integer ID using a learned vocabulary.

Tokenisation: Why do we tokenise text?

  • To reduce vocabulary size to a few tens of thousands of entries
  • To handle out-of-vocabulary (OOV) words effectively
  • To enable linguistically robust and multilingual processing
  • To maintain manageable sequence lengths (due to the quadratic cost of the attention mechanism)
Let’s look at the following sentence: „I’m testing tokenization algorithms.”
Model/Tokeniser Result
GPT-2 (BPE) [’I’, '’’, 'm’, ’ testing’, ’ token’, 'ization’, ’ algorithms’, ’.’]
BERT (WordPiece) ['[CLS]’, 'i’, ”’, 'm’, 'testing’, 'token’, '##ization’, 'algorithms’, ’.’, '[SEP]’]
Llama (SentencePiece) [’▁I’, '’’, 'm’, '▁testing’, '▁token’, 'ization’, '▁algorithms’, ’.’]
The underscores (▁) in SentencePiece indicate word boundaries.
The „##” prefixes in WordPiece indicate a sub-word.
GPT-2 uses encoded spaces (’ testing’) to preserve segmentation.

Overview of primary tokenisation methods

✅ BPE (Byte-Pair Encoding)

Used by: GPT-2, GPT-3, GPT-J, CLIP…
Principle: Starts with a minimal vocabulary (often characters or byte-level tokens) and learns to merge the most frequent pairs of symbols found in the corpus.

Process:

  • Decompose each word into characters (or bytes)
  • Count the most frequent adjacent pairs
  • Merge these pairs into a single token
  • Repeat until the target vocabulary size is reached (typically between 30k and 50k)
Simplified example with „lower”:
Initial: [’l’, 'o’, 'w’, 'e’, 'r’]
Merge: [’l’, 'ow’, 'e’, 'r’]
Final merge: [’l’, 'ow’, 'er’]
Advantages
  • Excellent trade-off between granularity and compactness
  • Robust against neologisms, typos, and proper nouns
  • Stable vocabulary once trained
Disadvantages
  • Dependent on the training corpus
  • Lacks linguistic intuition
  • Segmentation can sometimes be counter-intuitive

✅ WordPiece (BERT)

Used by: BERT, RoBERTa, DistilBERT, Albert…
Principle: Inspired by BPE, but uses a probabilistic algorithm based on the likelihood of the corpus.

Unlike BPE, WordPiece seeks to maximise the global probability of the corpus according to a language model.

Comparative example:

Word WordPiece tokens
Playing [’play’, '##ing’]
Football [’foot’, '##ball’]
Again [’again’]

Practical implementation with Hugging Face

Python Example:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Unbelievable results!")
print(tokens)
# ['un', '##bel', '##iev', '##able', 'results', '!']

BERT Specifics

  • [CLS]: Special token added at the beginning of a sequence (used for classification)
  • [SEP]: Separator token between two sentences (used for Next Sentence Prediction)
  • Casing: Uncased models convert text to lowercase and remove accents; cased models preserve them

SentencePiece

Used by: Llama, T5, mT5, XLNet, ByT5, ALBERT (variants), MarianMT, etc.
Principle: SentencePiece is a self-sufficient tokeniser trained directly on raw text. It does not assume that words are separated by spaces.

How it works

  • It makes no assumptions about space separation (essential for agglutinative or logographic languages)
  • It can operate at the character, byte, or Unicode codepoint level
  • It implements two algorithms:
    • Unigram Language Model (default in T5)
    • Modified BPE (optional)
  • Trained via a joint probability of sub-tokens, maximising probability while minimising vocabulary complexity
Example with „The cat sat.”:
['▁The', '▁cat', '▁sat', '.']

The character represents a space integrated into the token, allowing for the exact reconstruction of the original text.

Multilingual Example

Japanese phrase: „私はAIが好きです”
['▁私', 'は', 'AI', 'が', '好', 'き', 'です']
Advantages
  • Language-independent, regardless of writing conventions
  • Perfectly handles languages without spaces
  • Accurate representation of raw text (exact reversibility)
Disadvantages
  • Tokens are less human-readable
  • Segmentation can sometimes be counter-intuitive
  • Unigram LM is more computationally expensive to train

Practical implementation with SentencePiece

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("m.model")

tokens = sp.encode("The cat sat.", out_type=str)
print(tokens)
# ['▁The', '▁cat', '▁sat', '.']

Typical Use Cases

Model Method Used Characteristics
T5 Unigram LM Full text-to-text, position agnostic
Llama BPE in SentencePiece Long context, multilingual
mT5 Unigram 101 languages, balanced performance
ByT5 Byte-level Directly on UTF-8 (maximum robustness)

Edge Cases in Tokenisation

Consider a variable identifier or a rare name:

input = "z3r0C00l@#"

GPT-2:

['z', '3', 'r', '0', 'C', '00', 'l', '@', '#']

BERT:

['z', '##3', '##r', '##0', '##c', '##00', '##l', '@', '#']

The more unfamiliar a word is, the more it will be fragmented—which impacts:

  • Sequence length (and consequently the quadratic cost of the attention mechanism)
  • Semantic reasoning capabilities of the model
  • Cross-attention in encoder-decoder architectures

Impact on Performance

  • Overly granular tokenisation: Leads to longer sequences and increased memory consumption
  • Overly coarse tokenisation: Results in loss of generalisation and an increase in OOV (Out-Of-Vocabulary) instances
Newer tokenisers, such as tiktoken (OpenAI), optimise this trade-off by combining BPE with Unicode heuristics.

Tokenisation – Python Example

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokens = tokenizer.tokenize("The cat sat.")
print(tokens)  
# ['The', 'Ġcat', 'Ġsat', '.']

The Ġ character encodes a space. The vocabulary is trained to recognise frequent patterns within the original corpus.

Conclusion

Tokenisation is far more than mere text pre-processing: it is the fundamental gateway to a language model. The choice of algorithm (BPE, WordPiece, SentencePiece) directly influences how the model perceives and encodes information.

  • Better interpret LLM outputs
  • Optimise training and inference
  • Adapt the processing pipeline to domain-specific constraints

It is also a critical step for any fine-tuning project or custom LLM pre-training initiative.

Further Reading