Classical ML - Intermediate - 15 min

Learn Support Vector Machine

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Simple theory: A support vector machine is a classifier that looks for the cleanest separating boundary between classes. It prefers the boundary with the widest safety margin from the closest examples.

Most classifiers just need a boundary that separates classes — any boundary will do. SVM demands the best boundary: the one with the widest possible gap between the two classes. That gap is the margin. Maximising the margin is not just a visual preference — it is a principled mathematical guarantee of better generalisation.

The SVM objective

The decision function classifies a point x: positive output = class +1, negative output = class −1. The margin is the gap between the two parallel boundary planes. SVM maximises margin by minimising the weight vector norm — smaller ‖w‖ means wider margin.

Decision function:   f(x) = w · x + b
  class +1  if  f(x) ≥ +1   (on or beyond positive margin plane)
  class −1  if  f(x) ≤ −1   (on or beyond negative margin plane)

Margin width = 2 / ‖w‖

Objective: minimise ½‖w‖²  subject to  yᵢ(w·xᵢ + b) ≥ 1  ∀i

Maximising 2/‖w‖ ≡ minimising ‖w‖². SVM turns margin maximisation into a convex QP problem.

Support vectors and the margin

Support vectors are the training points sitting exactly on the margin planes — the points for which yᵢ(w·xᵢ + b) = 1. They are the only points that define the boundary. Delete every other point and the SVM doesn't move.

Soft margin — the C parameter

Hard margin SVMs fail when data isn't perfectly separable (overlapping classes). The soft margin allows some points to violate the margin, penalised by a slack variable ξᵢ. The parameter C controls this trade-off.

Soft margin objective:
  minimise  ½‖w‖² + C·Σξᵢ
  subject to  yᵢ(w·xᵢ + b) ≥ 1 − ξᵢ,   ξᵢ ≥ 0

Large C → narrow margin, few violations (risk: overfitting)
Small C → wide margin, more violations allowed (risk: underfitting)

C is the most important SVM hyperparameter — tune via cross-validation

The kernel trick

When classes aren't linearly separable, the kernel trick implicitly maps features to a higher-dimensional space where a linear boundary separates them — without ever computing the transformation.

  • Linear kernel: K(x,z) = x·z — fast, good for high-dimensional text data
  • RBF / Gaussian: K(x,z) = exp(−γ‖x−z‖²) — most versatile, handles circular/irregular boundaries
  • Polynomial: K(x,z) = (x·z + c)ᵈ — good for image recognition, d controls curve complexity
  • Sigmoid: K(x,z) = tanh(αx·z + c) — similar to neural network hidden layer

Practice questions

  1. What are 'support vectors' in an SVM?
  2. Why does maximising the margin improve generalisation?
  3. When data is NOT linearly separable, SVMs use:
  4. A soft-margin SVM uses a parameter C. What does a very large C mean?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Classical ML lessons