Simple theory: A support vector machine is a classifier that looks for the cleanest separating boundary between classes. It prefers the boundary with the widest safety margin from the closest examples.
Most classifiers just need a boundary that separates classes — any boundary will do. SVM demands the best boundary: the one with the widest possible gap between the two classes. That gap is the margin. Maximising the margin is not just a visual preference — it is a principled mathematical guarantee of better generalisation.
The SVM objective
The decision function classifies a point x: positive output = class +1, negative output = class −1. The margin is the gap between the two parallel boundary planes. SVM maximises margin by minimising the weight vector norm — smaller ‖w‖ means wider margin.
Decision function: f(x) = w · x + b
class +1 if f(x) ≥ +1 (on or beyond positive margin plane)
class −1 if f(x) ≤ −1 (on or beyond negative margin plane)
Margin width = 2 / ‖w‖
Objective: minimise ½‖w‖² subject to yᵢ(w·xᵢ + b) ≥ 1 ∀iMaximising 2/‖w‖ ≡ minimising ‖w‖². SVM turns margin maximisation into a convex QP problem.
Support vectors and the margin
Support vectors are the training points sitting exactly on the margin planes — the points for which yᵢ(w·xᵢ + b) = 1. They are the only points that define the boundary. Delete every other point and the SVM doesn't move.
Soft margin — the C parameter
Hard margin SVMs fail when data isn't perfectly separable (overlapping classes). The soft margin allows some points to violate the margin, penalised by a slack variable ξᵢ. The parameter C controls this trade-off.
Soft margin objective:
minimise ½‖w‖² + C·Σξᵢ
subject to yᵢ(w·xᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0
Large C → narrow margin, few violations (risk: overfitting)
Small C → wide margin, more violations allowed (risk: underfitting)C is the most important SVM hyperparameter — tune via cross-validation
The kernel trick
When classes aren't linearly separable, the kernel trick implicitly maps features to a higher-dimensional space where a linear boundary separates them — without ever computing the transformation.
- Linear kernel: K(x,z) = x·z — fast, good for high-dimensional text data
- RBF / Gaussian: K(x,z) = exp(−γ‖x−z‖²) — most versatile, handles circular/irregular boundaries
- Polynomial: K(x,z) = (x·z + c)ᵈ — good for image recognition, d controls curve complexity
- Sigmoid: K(x,z) = tanh(αx·z + c) — similar to neural network hidden layer