Learn Fine-tuning LLMs

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

A pre-trained LLM is a generalist that has read most of the internet — but it doesn't speak your specific domain, follow your tone, or perform your exact task out of the box. Fine-tuning takes the pretrained model and continues training it briefly on your task-specific data. The challenge: a 7B-parameter model has 28GB of weights — full fine-tuning costs serious money. Modern parameter-efficient methods (LoRA, prompt tuning) let you adapt huge models with as little as 0.05% of the original parameters.

Full Fine-Tuning (the 2018-2020 default)

Procedure:
  1. Load pretrained weights into a fresh trainer.
  2. Add (or replace) a task-specific head (linear layer for classification,
     or just continue language-model head for generation).
  3. Train ALL weights end-to-end on your task data.
     Use a small learning rate (1e-5 to 5e-5) — much smaller than pretraining
     because you don't want to destroy what's already learned.
  4. Run for 1-5 epochs — overfitting is real with small datasets.

Cost analysis (LLaMA-7B example):
  • 7B params × 4 bytes (FP32) = 28GB just for weights
  • + optimiser states (Adam) ≈ 56GB
  • + activations + gradients ≈ another 30GB
  → Need at least 1× A100 80GB GPU, often 4-8 in parallel
  → Hours to days of training time
  → Each task = one full fine-tuned model copy = 28GB of weights to store and serve

Updates every weight · maximum capacity · prohibitive cost for many users

LoRA — Low-Rank Adaptation (2021)

Insight: when fine-tuning, the WEIGHT CHANGES (ΔW) tend to be low-rank.
Decompose them as a product of two small matrices:

  ΔW = B · A   where A: (r × d), B: (d × r), r ≪ d

For a typical attention weight matrix W (d × d) in a 7B model:
  full ΔW:  d × d = 4096 × 4096 = 16.7M params
  LoRA ΔW:  with r=16 → 4096×16 + 16×4096 = 130K params  (128× smaller!)

During fine-tuning:
  • W (pretrained) stays FROZEN
  • Only A and B are trained
  • Effective forward pass: y = (W + B·A) · x

Results are often within 1-2% of full fine-tuning at 0.5% of the cost.
LoRA adapters are ~30MB instead of ~28GB — easy to store, swap, and stack.

Train tiny rank-r matrices · keep base model frozen · same performance for ~1% of compute

Prompt Tuning / Prefix Tuning

Even more aggressive: don't change ANY weights. Just learn a few extra 'soft prompt' vectors prepended to the input.

Normal input:           [tok1, tok2, tok3, ...]
Prompt-tuned input:     [P1, P2, P3, ..., Pk, tok1, tok2, ...]
  where P1...Pk are LEARNED VECTORS (not real words),
  trained via gradient descent on task data.

Typical k = 10-100 soft prompts × 4096 dims = 40-400K trainable params.
That's 0.005% of a 7B model.

Works surprisingly well for many tasks at huge model scale (>10B params).
Weaker for smaller models — they need more parameter capacity to adapt.

No weight updates at all · a few soft tokens steer the entire frozen model

Comparison Table

Full FT: 100% of params trainable, ~28GB ΔW per task, highest quality, highest cost. Use when: small to medium model, lots of compute, task very different from pretraining.
LoRA: 0.5-1% trainable, ~30MB per task, near-full quality, low cost. Default modern choice. Stack multiple LoRA adapters for multi-task models.
QLoRA (LoRA + 4-bit base): same as LoRA but base model is quantised to 4-bit weights. Fits 70B models on a single 24GB GPU. Tiny accuracy hit.
Prompt tuning: 0.005-0.05% trainable, ~few MB, decent quality at 10B+ scale. Use when: many tasks to support, extreme storage constraints.
Frozen + classifier head: 0.1% trainable, smallest setup, lowest quality but cheapest. Use when: simple classification on a representation model.

Full Fine-Tuning (the 2018-2020 default)

LoRA — Low-Rank Adaptation (2021)

Prompt Tuning / Prefix Tuning

Comparison Table

Instruction Tuning + RLHF

Practice questions

Related AI learning resources