A convolutional layer slides a small filter (kernel) across the input, computing a dot product at each position. The result is a feature map showing where in the image the pattern encoded by that kernel appears. Stack multiple kernels and you detect edges, textures, shapes — each layer building more abstract features from the previous one's output.
The Convolution Operation
For an input X (H×W) and kernel K (k×k), with stride s:
Output[r, c] = Σᵢ Σⱼ K[i,j] · X[r·s + i, c·s + j]
Output size formula:
H_out = ⌊(H − k) / s⌋ + 1
W_out = ⌊(W − k) / s⌋ + 1
Example: 7×7 input, 3×3 kernel, stride=1:
H_out = ⌊(7−3)/1⌋+1 = 5 → 5×5 feature map
Example: 7×7 input, 3×3 kernel, stride=2:
H_out = ⌊(7−3)/2⌋+1 = 3 → 3×3 feature mapStride controls how far the kernel jumps each step — stride=2 halves the spatial dimensions
What Kernels Learn to Detect
- Early layers (raw pixels → edges): kernels learn simple patterns — horizontal edges, vertical edges, diagonal gradients, colour blobs.
- Middle layers (edges → shapes): kernels combine edge activations into corners, circles, simple textures, local patterns.
- Deep layers (shapes → objects): kernels build representations of eyes, wheels, fur — entire semantic concepts.
- This hierarchical feature learning is not hand-coded — the kernels are learned from data by backpropagation.
Parameter Sharing — Why CNNs are Efficient
Fully connected layer (224×224 image → 1000 neurons):
Parameters = 224 × 224 × 3 × 1000 = 150,528,000
Convolutional layer (same image, 64 kernels of 3×3×3):
Parameters = 3 × 3 × 3 × 64 = 1,728
The same kernel weights are used at every position → parameter sharing.
Result: 87,000× fewer parameters for roughly equivalent feature richness.
Fewer parameters = less overfitting, less memory, faster training.Parameter sharing is the core advantage of convolution over fully-connected layers for images
Padding
Without padding: output shrinks each layer.
7×7 → 5×5 → 3×3 → 1×1 (only 3 convolutions possible!)
With same padding (padding = ⌊k/2⌋ = 1 for k=3):
Output size = Input size (for stride=1)
7×7 → 7×7 → 7×7 → ... (can go as deep as needed)
Padding adds zeros around the input border.
Most modern CNNs use padding='same' to preserve spatial dimensions.Padding='same' keeps spatial size constant — essential for deep networks