BERT (Bidirectional Encoder Representations from Transformers, Google 2018) was the model that made deep NLP practical for everyone. The trick: train a transformer encoder by randomly masking words in a sentence and asking it to predict them, using context from BOTH sides. This pre-training task is so general that the resulting model can be fine-tuned in a few hours to dominate almost any downstream NLP benchmark.
Architecture: Encoder-Only Transformer
BERT = a stack of transformer encoder blocks (no decoder).
BERT-base: 12 encoder layers, 768-dim, 12 heads → 110M params
BERT-large: 24 encoder layers, 1024-dim, 16 heads → 340M params
Key design choices:
• All attention is bidirectional — every token attends to every other token (no causal mask)
• Input format: [CLS] sentence_A [SEP] sentence_B [SEP]
where [CLS] aggregates info for classification
[SEP] separates sentence pairs (for entailment, QA tasks)
• Each token has 3 embeddings summed: token + position + segment (sentence A vs B)
Used only the encoder half of the original Transformer paper.
Decoder was unnecessary because BERT does not generate — it only encodes.Encoder = bidirectional context · output: a rich vector per input token + a sentence vector at [CLS]
Pre-training Objective 1: Masked Language Modelling (MLM)
For each training example (sentence from corpus):
1. Randomly select 15% of tokens.
2. Of those:
• 80% replaced with [MASK]
• 10% replaced with a random word
• 10% kept unchanged
3. Run sentence through BERT.
4. At each masked position, predict the original token via softmax over vocab.
5. Loss = cross-entropy on masked positions only.
Why 80/10/10? If we used 100% [MASK], the model would only know how to fill masks — useless at fine-tuning where there are no masks. Mixing in random/unchanged tokens forces robust representations.
Example:
Original: 'The cat sat on the mat.'
Masked: 'The [MASK] sat on the mat.'
Target: cat
Loss: −log P(cat | context)Predicting the missing word forces deep understanding of bidirectional context
Pre-training Objective 2: Next Sentence Prediction (NSP)
Given two sentences A and B:
• 50% of the time, B truly follows A in the corpus → label = IsNext
• 50% of the time, B is a random sentence → label = NotNext
Classification done from the [CLS] token's final embedding via a small MLP.
Motivation: many downstream tasks (entailment, QA) involve sentence pairs.
NSP gives BERT explicit pretraining signal for inter-sentence relationships.
Note: subsequent work (RoBERTa) showed NSP wasn't essential — removing it and training longer on MLM only worked equally well or better. But NSP is part of original BERT.MLM + NSP together = the original BERT recipe
Fine-tuning: One Pretrained Model, Many Tasks
- Sentence classification (sentiment, topic): use [CLS] embedding → linear layer → softmax. Fine-tune all weights for 2-4 epochs on small task-specific data.
- Token-level tasks (NER, POS): use the per-token output embedding → linear layer → softmax. Each token gets its own predicted label.
- Sentence pair tasks (entailment, similarity): pack both sentences with [SEP], use [CLS] embedding for the pair-level prediction.
- Question answering (SQuAD): pack [CLS] question [SEP] passage [SEP]. Predict start and end token indices in the passage that bracket the answer.