Information theory answers a deceptively simple question: how much do you learn when something happens? Surprising events carry more information than predictable ones. If someone tells you 'the sun rose this morning', you learned nothing — you already knew it. If they say 'it snowed in July', you learned a lot. Entropy quantifies this precisely.
The entropy formula
H(X) = −Σ p(x) · log₂ p(x). For each possible outcome: multiply its probability by the log of its probability, sum them up, and negate. The result is measured in bits. For a fair coin: H = −(0.5 · log₂ 0.5) − (0.5 · log₂ 0.5) = −(0.5 · −1) − (0.5 · −1) = 1 bit.
Cross-entropy loss in ML
When a classifier outputs probabilities [0.7, 0.2, 0.1] but the true label is class 2, cross-entropy loss measures how different those two distributions are. H(true, predicted) = −Σ true · log(predicted). A perfect prediction makes cross-entropy = true entropy (minimum possible). An overconfident wrong prediction makes it huge. This is why cross-entropy is the default loss for every classification model.