A neural network is neurons organised into layers. Every neuron in one layer connects to every neuron in the next — that's a fully connected (dense) network. The architecture — how many layers, how many neurons per layer — determines what the network can and cannot learn. Too small: it can't fit the data. Too large: it memorises noise.
The three kinds of layers
- Input layer: one neuron per feature in your data. For a 28×28 image: 784 neurons. For a house with 5 features: 5 neurons. No computation happens here — it just holds the data.
- Hidden layers: where the network learns. Each hidden neuron computes a weighted sum of all outputs from the previous layer, adds a bias, and applies an activation function. One or more hidden layers = 'deep' network.
- Output layer: one neuron per class (classification) or one neuron per target value (regression). Applies softmax for multi-class, sigmoid for binary, or linear for regression.
Parameters: what the network learns
Parameters in one layer:
Weights = (neurons in previous layer) × (neurons in this layer)
Biases = neurons in this layer
Total = prev × curr + curr
Example: 3 → 4 → 2 network:
Layer 1 (3→4): 3×4 + 4 = 16 params
Layer 2 (4→2): 4×2 + 2 = 10 params
Grand total = 26 parametersEvery arrow in the diagram is a weight. More connections = more parameters = more capacity (and more data needed to train)
Depth vs width
A wider network (more neurons per layer) learns more features at the same level of abstraction. A deeper network (more layers) learns hierarchical representations — layer 1 learns edges, layer 2 learns shapes, layer 3 learns objects. In practice, depth is more valuable than width for complex tasks: a 10-layer narrow network almost always outperforms a 1-layer wide network with the same parameter count.
Rules of thumb
- Start with 2–3 hidden layers for most tasks — more is rarely better without regularisation
- Hidden layer width: 64–512 neurons depending on task complexity
- First hidden layer is usually wider (captures many low-level patterns), then narrows toward the output
- Output layer: 1 neuron (regression), 1 neuron with sigmoid (binary classification), K neurons with softmax (K-class classification)
- If training loss stays high → network too small (underfitting) → add neurons or layers
- If training loss low but validation loss high → network too large (overfitting) → add dropout or regularisation