Before a model can do anything visual, you need to remember one thing: a digital image is just a grid of numbers. There is no 'pixel' in the model's view — only an integer between 0 and 255 telling it how much red, green, or blue lives at that position. Once you internalise this, every later concept (filters, edges, segmentation, detection) is just clever math on this grid.
The Image Tensor: H × W × C
A color image is a 3D tensor:
shape = [Height, Width, Channels]
For a typical RGB photo:
• Height: rows (e.g., 1080)
• Width: columns (e.g., 1920)
• Channels: 3 (red, green, blue)
Each element image[y][x][c] is a number, traditionally:
• uint8 — integer in [0, 255]
• float — normalised to [0, 1] or [−1, +1] before feeding to a neural net
Special cases:
• Grayscale: 1 channel — just brightness
• RGBA: 4 channels — RGB + alpha (transparency)
• Depth/IR/multi-spectral: arbitrary channel count (drone imagery can have 7-12)image[y][x][c] → one number · the entire 'photo' is just billions of these
RGB Color: Three Numbers Per Pixel
- Pure red: [255, 0, 0] — full red, no green, no blue.
- Pure white: [255, 255, 255] — all channels max.
- Pure black: [0, 0, 0] — all channels zero.
- Yellow: [255, 255, 0] — red + green, no blue.
- A soft pink: [240, 180, 200] — high R, medium G, medium-high B.
Pre-processing for Neural Networks
Raw images need preparation before feeding to a model:
1. Resize: rescale to a fixed dimension (e.g., 224×224 for ImageNet models).
Done with bilinear/bicubic interpolation.
2. Convert to float: divide by 255 to map [0, 255] → [0, 1].
3. Normalise: standardise each channel using dataset statistics:
x_normalised = (x − mean) / std
ImageNet means: [0.485, 0.456, 0.406]
ImageNet stds: [0.229, 0.224, 0.225]
This puts each channel near mean 0, variance 1 — what the model expects.
4. Permute: shuffle axes if needed (HWC ↔ CHW for the framework).
5. Batch: stack N images into a 4D tensor [N, C, H, W] for parallel GPU processing.Almost every pretrained CNN expects this exact pre-processing — get it wrong and accuracy collapses
What Each Channel Captures
- Red channel: bright on warm objects (skin, fire, sunset). Skin tones especially live mostly in red.
- Green channel: bright on plants, grass, signs (eyes evolved to be most sensitive to green).
- Blue channel: bright on sky, water, shadows. Often the noisiest in low light.
- Inspecting a single channel as grayscale reveals what cues live where — useful for debugging.