Learn CNN — Pooling & Full Architecture - Free Visual AI and ML Lesson

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

A convolutional layer produces feature maps that are spatially large — a single conv layer on a 224×224 image still produces a 224×224 output. Pooling layers reduce the spatial dimensions by summarising regions. Stack convolution, activation, and pooling into a pipeline and you have the full CNN — a hierarchical feature extractor that goes from raw pixels to class labels.

Max Pooling

For a p×p pooling window with stride s:

  Output[r, c] = max{ X[r·s + i, c·s + j]  for i,j ∈ [0, p) }

Output size formula:
  H_out = ⌊(H − p) / s⌋ + 1

Most common: 2×2 max pool, stride 2:
  H_out = ⌊(H−2)/2⌋+1 = H/2  →  halves spatial dimensions
  A 28×28 feature map becomes 14×14
  A 56×56 feature map becomes 28×28

No parameters — max pool has nothing to learn.

Max pooling keeps the strongest activation in each region — 'was this pattern present anywhere here?'

Average Pooling

  Output[r, c] = (1/p²) × Σᵢ Σⱼ X[r·s + i, c·s + j]

Average pooling takes the mean of the window instead of the max.

Global Average Pooling (GAP) — a special case used in modern networks:
  Takes the average of the ENTIRE feature map (one value per channel)
  A 7×7×512 feature map → 1×1×512 → flatten to 512-vector

Modern ResNets, EfficientNets, and MobileNets all use GAP
before the final classifier instead of large FC layers.
  ResNet-50: GAP replaces 25M-parameter FC block → same accuracy, 25M fewer params.

GAP is especially resistant to overfitting — no extra parameters to memorise training data

Full CNN Architecture: the Pipeline

Typical CNN pipeline stage:

  [Conv k×k, C filters] → [BatchNorm] → [ReLU] → [MaxPool 2×2]

Full example (LeNet-5, 1998):
  Input 32×32×1
  → Conv 5×5, 6 filters  → 28×28×6    (no padding)
  → Tanh                  → 28×28×6
  → MaxPool 2×2           → 14×14×6    ← spatial ÷2
  → Conv 5×5, 16 filters → 10×10×16
  → Tanh                  → 10×10×16
  → MaxPool 2×2           → 5×5×16     ← spatial ÷2
  → Flatten               → 400
  → FC 120 → FC 84 → Softmax 10

Total parameters: ~60,000 — a complete digit classifier in 60K numbers.

LeNet-5 introduced the Conv→Pool→Conv→Pool→FC pattern still used today

How Spatial Dimensions Change

Each Conv layer (no padding) shrinks the spatial size by (k−1) — a 3×3 conv on 28×28 → 26×26.
Each Conv layer (same padding) preserves spatial size — most modern CNNs use padding='same'.
Each MaxPool 2×2 (stride=2) halves spatial dimensions — 224→112→56→28→14→7.
Depth (channel count) grows as we go deeper: 3 → 64 → 128 → 256 → 512 is a typical pattern.
The final spatial size × channel count = the flattened vector fed to the classifier.
Smaller spatial + more channels = more abstract, richer representations for classification.

Parameter Count: Conv vs FC

Conv layer: C_in × k × k × C_out + C_out biases
  Example: 64 → 3×3 → 128:
    64 × 3 × 3 × 128 + 128 = 73,856 params

FC layer with same input (after 7×7×64 feature map):
  7 × 7 × 64 × 128 = 401,408 params — 5× more!

Why convolutions win:
  Same 3×3 kernel scans the entire 224×224 image.
  Weight sharing = 87,000× fewer params vs a dense layer of equivalent width.
  Fewer parameters = less overfitting, smaller model, faster inference.

Conv replaces spatial FC connections — each kernel is one tiny shared FC over a 3×3 patch

Learn CNN — Pooling & Full Architecture