NLP - Beginner - 10 min

Learn Tokenization & Preprocessing

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Computers don't read words — they read numbers. Before a model can process text, the text must be split into discrete units called tokens, and each token must be mapped to an integer ID. This step is called tokenization, and the choice of strategy quietly determines a model's vocabulary size, generalisation, and ability to handle new words.

Three Tokenization Strategies

  • Word-level: split on whitespace and punctuation. Simple but vocab explodes (every plural, every typo, every conjugation = a new token). Fails on out-of-vocabulary words — replaced by <UNK> and information is lost.
  • Character-level: every character is a token. Vocab is tiny (~100). But sequences are 5-8× longer, and the model has to learn that 'c-a-t' means the same as 'cat'. Slow and harder to train.
  • Subword (BPE / WordPiece / SentencePiece): the modern default. Common words stay whole ('the', 'cat'), rare words split into known fragments ('un' + 'happy' + 'ness'). Vocabulary stays around 30k-100k while keeping all words representable.

Byte-Pair Encoding (BPE) — the GPT approach

Training BPE on a corpus:

  1. Start with character-level vocabulary
  2. Count all adjacent character pairs
  3. Merge the most frequent pair into one token
  4. Add merged token to vocabulary
  5. Repeat for ~30,000 iterations

Example (small corpus):
  Initial:  'l', 'o', 'w', '#'  (# = end-of-word)
  Pair counts: 'l o' = 5, 'o w' = 4, 'w #' = 3 ...
  Merge 'l o' → 'lo'
  Pair counts: 'lo w' = 4, 'o w' = 0 (consumed) ...
  Continue merging until vocab is full.

At inference, apply the same merges greedily to new text.
Result: common letter-pairs like 'th', 'ing', 'tion' become single tokens.

BPE: bottom-up merging until vocab budget is full · used by GPT-2/3/4, RoBERTa

Preprocessing — what happens before tokenization

  • Unicode normalisation: NFKC normalisation collapses 'é' (one char) and 'e´' (two chars) into the same form.
  • Lowercasing: was standard for older NLP (BERT-base-uncased) — modern LLMs (GPT-4, LLaMA) are case-sensitive because case carries meaning ('Apple' vs 'apple').
  • Punctuation handling: split punctuation as separate tokens or merge with surrounding word. BPE usually handles this automatically.
  • Special tokens: [CLS] (classification), [SEP] (separator), [MASK] (BERT-style masking), <BOS>/<EOS> (start/end of sequence), <PAD> (padding short sequences in a batch).
  • Truncation: sequences longer than the model's max length (512 for BERT, 8k-200k for GPT-4) must be cut.

Practice questions

  1. Why is subword tokenization (BPE/WordPiece) the dominant approach in modern LLMs?
  2. Approximately how many tokens does an English paragraph of 100 words occupy in a typical BPE tokenizer?
  3. What's the key idea behind the BPE training algorithm?
  4. Why must you use the same tokenizer that was used during pre-training when fine-tuning a model?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More NLP lessons