Learn Support Vector Machine

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Simple theory: A support vector machine is a classifier that looks for the cleanest separating boundary between classes. It prefers the boundary with the widest safety margin from the closest examples.

Most classifiers just need a boundary that separates classes — any boundary will do. SVM demands the best boundary: the one with the widest possible gap between the two classes. That gap is the margin. Maximising the margin is not just a visual preference — it is a principled mathematical guarantee of better generalisation.

The SVM objective

The decision function classifies a point x: positive output = class +1, negative output = class −1. The margin is the gap between the two parallel boundary planes. SVM maximises margin by minimising the weight vector norm — smaller ‖w‖ means wider margin.

Decision function:   f(x) = w · x + b
  class +1  if  f(x) ≥ +1   (on or beyond positive margin plane)
  class −1  if  f(x) ≤ −1   (on or beyond negative margin plane)

Margin width = 2 / ‖w‖

Objective: minimise ½‖w‖²  subject to  yᵢ(w·xᵢ + b) ≥ 1  ∀i

Maximising 2/‖w‖ ≡ minimising ‖w‖². SVM turns margin maximisation into a convex QP problem.

Support vectors and the margin

Support vectors are the training points sitting exactly on the margin planes — the points for which yᵢ(w·xᵢ + b) = 1. They are the only points that define the boundary. Delete every other point and the SVM doesn't move.

Soft margin — the C parameter

Hard margin SVMs fail when data isn't perfectly separable (overlapping classes). The soft margin allows some points to violate the margin, penalised by a slack variable ξᵢ. The parameter C controls this trade-off.

Soft margin objective:
  minimise  ½‖w‖² + C·Σξᵢ
  subject to  yᵢ(w·xᵢ + b) ≥ 1 − ξᵢ,   ξᵢ ≥ 0

Large C → narrow margin, few violations (risk: overfitting)
Small C → wide margin, more violations allowed (risk: underfitting)

C is the most important SVM hyperparameter — tune via cross-validation

The kernel trick

When classes aren't linearly separable, the kernel trick implicitly maps features to a higher-dimensional space where a linear boundary separates them — without ever computing the transformation.

Linear kernel: K(x,z) = x·z — fast, good for high-dimensional text data
RBF / Gaussian: K(x,z) = exp(−γ‖x−z‖²) — most versatile, handles circular/irregular boundaries
Polynomial: K(x,z) = (x·z + c)ᵈ — good for image recognition, d controls curve complexity
Sigmoid: K(x,z) = tanh(αx·z + c) — similar to neural network hidden layer