NLP - Advanced - 18 min

Learn BERT — Bidirectional Encoder

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

BERT (Bidirectional Encoder Representations from Transformers, Google 2018) was the model that made deep NLP practical for everyone. The trick: train a transformer encoder by randomly masking words in a sentence and asking it to predict them, using context from BOTH sides. This pre-training task is so general that the resulting model can be fine-tuned in a few hours to dominate almost any downstream NLP benchmark.

Architecture: Encoder-Only Transformer

BERT = a stack of transformer encoder blocks (no decoder).

BERT-base:    12 encoder layers, 768-dim, 12 heads → 110M params
BERT-large:   24 encoder layers, 1024-dim, 16 heads → 340M params

Key design choices:
  • All attention is bidirectional — every token attends to every other token (no causal mask)
  • Input format: [CLS] sentence_A [SEP] sentence_B [SEP]
      where [CLS] aggregates info for classification
      [SEP] separates sentence pairs (for entailment, QA tasks)
  • Each token has 3 embeddings summed: token + position + segment (sentence A vs B)

Used only the encoder half of the original Transformer paper.
Decoder was unnecessary because BERT does not generate — it only encodes.

Encoder = bidirectional context · output: a rich vector per input token + a sentence vector at [CLS]

Pre-training Objective 1: Masked Language Modelling (MLM)

For each training example (sentence from corpus):

  1. Randomly select 15% of tokens.
  2. Of those:
       • 80% replaced with [MASK]
       • 10% replaced with a random word
       • 10% kept unchanged
  3. Run sentence through BERT.
  4. At each masked position, predict the original token via softmax over vocab.
  5. Loss = cross-entropy on masked positions only.

Why 80/10/10? If we used 100% [MASK], the model would only know how to fill masks — useless at fine-tuning where there are no masks. Mixing in random/unchanged tokens forces robust representations.

Example:
  Original:  'The cat sat on the mat.'
  Masked:    'The [MASK] sat on the mat.'
  Target:    cat
  Loss:      −log P(cat | context)

Predicting the missing word forces deep understanding of bidirectional context

Pre-training Objective 2: Next Sentence Prediction (NSP)

Given two sentences A and B:
  • 50% of the time, B truly follows A in the corpus → label = IsNext
  • 50% of the time, B is a random sentence       → label = NotNext

Classification done from the [CLS] token's final embedding via a small MLP.

Motivation: many downstream tasks (entailment, QA) involve sentence pairs.
NSP gives BERT explicit pretraining signal for inter-sentence relationships.

Note: subsequent work (RoBERTa) showed NSP wasn't essential — removing it and training longer on MLM only worked equally well or better. But NSP is part of original BERT.

MLM + NSP together = the original BERT recipe

Fine-tuning: One Pretrained Model, Many Tasks

  • Sentence classification (sentiment, topic): use [CLS] embedding → linear layer → softmax. Fine-tune all weights for 2-4 epochs on small task-specific data.
  • Token-level tasks (NER, POS): use the per-token output embedding → linear layer → softmax. Each token gets its own predicted label.
  • Sentence pair tasks (entailment, similarity): pack both sentences with [SEP], use [CLS] embedding for the pair-level prediction.
  • Question answering (SQuAD): pack [CLS] question [SEP] passage [SEP]. Predict start and end token indices in the passage that bracket the answer.

Practice questions

  1. What is the key task BERT is pre-trained on?
  2. Why is BERT called 'bidirectional', and how does this differ from GPT?
  3. What does the [CLS] token in BERT do?
  4. Why was BERT's release so significant for the NLP field?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More NLP lessons