Deep Learning - Intermediate - 15 min

Learn CNN — Pooling & Full Architecture

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

A convolutional layer produces feature maps that are spatially large — a single conv layer on a 224×224 image still produces a 224×224 output. Pooling layers reduce the spatial dimensions by summarising regions. Stack convolution, activation, and pooling into a pipeline and you have the full CNN — a hierarchical feature extractor that goes from raw pixels to class labels.

Max Pooling

For a p×p pooling window with stride s:

  Output[r, c] = max{ X[r·s + i, c·s + j]  for i,j ∈ [0, p) }

Output size formula:
  H_out = ⌊(H − p) / s⌋ + 1

Most common: 2×2 max pool, stride 2:
  H_out = ⌊(H−2)/2⌋+1 = H/2  →  halves spatial dimensions
  A 28×28 feature map becomes 14×14
  A 56×56 feature map becomes 28×28

No parameters — max pool has nothing to learn.

Max pooling keeps the strongest activation in each region — 'was this pattern present anywhere here?'

Average Pooling

  Output[r, c] = (1/p²) × Σᵢ Σⱼ X[r·s + i, c·s + j]

Average pooling takes the mean of the window instead of the max.

Global Average Pooling (GAP) — a special case used in modern networks:
  Takes the average of the ENTIRE feature map (one value per channel)
  A 7×7×512 feature map → 1×1×512 → flatten to 512-vector

Modern ResNets, EfficientNets, and MobileNets all use GAP
before the final classifier instead of large FC layers.
  ResNet-50: GAP replaces 25M-parameter FC block → same accuracy, 25M fewer params.

GAP is especially resistant to overfitting — no extra parameters to memorise training data

Full CNN Architecture: the Pipeline

Typical CNN pipeline stage:

  [Conv k×k, C filters] → [BatchNorm] → [ReLU] → [MaxPool 2×2]

Full example (LeNet-5, 1998):
  Input 32×32×1
  → Conv 5×5, 6 filters  → 28×28×6    (no padding)
  → Tanh                  → 28×28×6
  → MaxPool 2×2           → 14×14×6    ← spatial ÷2
  → Conv 5×5, 16 filters → 10×10×16
  → Tanh                  → 10×10×16
  → MaxPool 2×2           → 5×5×16     ← spatial ÷2
  → Flatten               → 400
  → FC 120 → FC 84 → Softmax 10

Total parameters: ~60,000 — a complete digit classifier in 60K numbers.

LeNet-5 introduced the Conv→Pool→Conv→Pool→FC pattern still used today

How Spatial Dimensions Change

  • Each Conv layer (no padding) shrinks the spatial size by (k−1) — a 3×3 conv on 28×28 → 26×26.
  • Each Conv layer (same padding) preserves spatial size — most modern CNNs use padding='same'.
  • Each MaxPool 2×2 (stride=2) halves spatial dimensions — 224→112→56→28→14→7.
  • Depth (channel count) grows as we go deeper: 3 → 64 → 128 → 256 → 512 is a typical pattern.
  • The final spatial size × channel count = the flattened vector fed to the classifier.
  • Smaller spatial + more channels = more abstract, richer representations for classification.

Parameter Count: Conv vs FC

Conv layer: C_in × k × k × C_out + C_out biases
  Example: 64 → 3×3 → 128:
    64 × 3 × 3 × 128 + 128 = 73,856 params

FC layer with same input (after 7×7×64 feature map):
  7 × 7 × 64 × 128 = 401,408 params — 5× more!

Why convolutions win:
  Same 3×3 kernel scans the entire 224×224 image.
  Weight sharing = 87,000× fewer params vs a dense layer of equivalent width.
  Fewer parameters = less overfitting, smaller model, faster inference.

Conv replaces spatial FC connections — each kernel is one tiny shared FC over a 3×3 patch

Practice questions

  1. A 14×14 feature map passes through 2×2 max pooling with stride=2. What is the output size?
  2. Why does max pooling provide translation invariance?
  3. What is Global Average Pooling and why do modern CNNs prefer it over fully connected layers?
  4. A CNN has architecture: 3×3 Conv (64 filters) → 3×3 Conv (128 filters). The first layer has 64 input channels. How many parameters are in the SECOND conv layer?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Deep Learning lessons