Deep Learning - Intermediate - 12 min

Learn Transfer Learning

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Training a deep network from scratch on a small dataset is like starting school without any prior knowledge — you need thousands of examples just to learn basic concepts. Transfer learning inverts this: start with a model already trained on millions of images, then adapt it to your specific task. The network has already learned to detect edges, textures, shapes, and objects — you just teach it what those features mean for your problem.

Why Transfer Learning Works

Deep CNNs learn hierarchical features. Early layers learn universal low-level patterns — horizontal edges, vertical edges, colour blobs, and simple textures — that are useful for almost any image task. Middle layers combine these into shapes and object parts. Only the deepest layers are task-specific. Since the low-level features generalise, they can be reused without retraining.

Three Strategies

  • Feature Extraction: freeze all backbone layers, replace only the final classification head, train the head on your data. Best when your dataset is small (< 1,000 samples) and your domain is similar to the pre-training domain (e.g., natural images → product photos).
  • Fine-tuning: freeze early layers (edges/textures), unfreeze later layers + head, train both on your data with a small learning rate. Best when your dataset is moderate (1,000–100,000 samples) or your domain is somewhat different.
  • Full Fine-tuning: unfreeze and train all layers. Best when you have a large dataset (> 100,000 samples) and sufficient compute. Use a very small learning rate for early layers to avoid destroying the pre-trained representations.

Rule of Thumb: Dataset Size × Domain Similarity

                Similar domain   Different domain
──────────────────────────────────────────────────
Small dataset  │  Feature extract  │  Fine-tune top layers
Large dataset  │  Fine-tune all    │  Fine-tune or from scratch
──────────────────────────────────────────────────

Small + similar  → safest case: freeze backbone, train only head.
Small + different → unfreeze more layers, but use small lr to not forget.
Large + similar  → fine-tune everything with a 10× smaller lr for early layers.
Large + different → consider training from scratch, or very aggressive fine-tuning.

The further your domain from the pre-training domain, the more layers need to adapt

Learning Rate for Fine-tuning

Typical fine-tuning learning rate schedule:

  Frozen layers:     lr = 0         (no update)
  Unfrozen backbone: lr = 1e-5      (very small — preserve features)
  New head:          lr = 1e-3      (normal — learn from scratch)

PyTorch: use different param_groups per layer group:
  optimizer = Adam([
    {'params': backbone.parameters(), 'lr': 1e-5},
    {'params': head.parameters(),     'lr': 1e-3},
  ])

Using the same lr for all layers is the most common fine-tuning mistake —
pre-trained representations get overwritten in the first few batches.

Differential learning rates are essential — backbone lr should be 100× smaller than head lr

Common Pre-trained Models

  • Image classification backbones: ResNet-50/101 (ILSVRC 2015 winner), EfficientNet-B0 to B7, ViT (Vision Transformer). All available in torchvision.models with pretrained=True.
  • Object detection: Faster R-CNN, YOLO, DETR — all use ImageNet-pretrained backbones.
  • Medical imaging: Often fine-tuned from ImageNet despite domain difference — the low-level feature representations still transfer even to X-rays and histology slides.
  • NLP transfer: BERT (Google, 2018) pre-trained on 3.3B words. GPT-2/3/4 pre-trained on web text. Fine-tune with 1,000–10,000 labelled examples for classification, NER, or generation tasks.
  • Foundation models (2023+): CLIP, DINO, SAM — pre-trained on web-scale data with self-supervised or contrastive objectives. Remarkably general — good zero-shot performance without any fine-tuning.

Practice questions

  1. You have 800 chest X-ray images labelled as normal or abnormal. Which transfer learning strategy is most appropriate?
  2. When fine-tuning a pre-trained CNN, why should the backbone learning rate be much smaller (e.g., 1e-5) than the new head's learning rate (e.g., 1e-3)?
  3. What makes early layers of a CNN (trained on ImageNet) useful for transfer to almost any image task?
  4. You're fine-tuning BERT for sentiment analysis using 50,000 movie reviews. After 3 epochs the training loss is near zero but validation F1 score plateaus at 0.71. What is likely happening?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Deep Learning lessons