mutalenucom / vol IV
The Mutalenu Review · Vol IV · Issue 04

Vol IV · Issue 04 · A Mutalenu Long Read

The Mathematics of Machine Learning.

A complete tour of the math that machine learning rests on. Linear algebra, vector calculus, probability, optimisation, and then the four canonical machine learning problems built directly on top. Long. Dense. Visual.

Length74 modules · 11 parts Read time~ 12 hours Last revisedMay 2026 PriceFree

Part 0

A note before you start.

By the end of this note you will be able to read a machine learning paper, understand the math in it, and tell the difference between a real result and a wrapped-up triviality. The path runs through six chapters of math foundations and four chapters of canonical ML problems built on those foundations. The math comes first because everything else collapses without it.

How to use this note

Machine learning rests on four pillars: linear algebra, vector calculus, probability, and continuous optimisation. Each one is a substantial subject in its own right. The first six parts of this note cover them in turn. The last five parts apply them to four classical ML problems: linear regression, principal component analysis, Gaussian mixture models, and support vector machines.

You are reading this in the order it was written. The dependencies run forward: nothing in Part V relies on something later. If a module is hard, the prerequisites are above it.

Throughout the note, three things appear alongside the prose:

  • Equations, set with MathJax. Inline math looks like \(y = mx + b\). Display math gets its own line: \[ \mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}. \]
  • Animations (manim videos), embedded inline. Looped, no audio. They show what the algebra is doing geometrically.
  • Interactive widgets, drawn in SVG. You drag points, slide parameters, and watch the math respond. Used wherever interactivity teaches better than animation.

Proofs are mostly omitted. The bias of this note is toward intuition, computation, and the "what is this for" question. When a result is stated without proof, you can take it on faith and look up the proof in any standard reference if you want one.

Each module ends with a "test yourself" prompt. Read it, try to answer aloud or on paper before moving on. The prompts are designed so that a confident answer means the module landed; a fuzzy one means a re-read is in order.

A note on background

You should be comfortable with high-school algebra and ideally a single-variable calculus course. Anything beyond that gets introduced where it shows up. If you have taken any university-level math course recently, you have plenty.

Test yourself

Why does this note teach the math foundations before the machine learning algorithms, and what would go wrong if you tried to learn them in the opposite order?

Part I. Linear algebra

The language everything else is written in.

A vector is the simplest object in machine learning that is more than a number. A matrix is the next step up. Almost every ML algorithm is, when you peel away the surface, a transformation of a vector by a matrix, possibly followed by a non-linear squash. Knowing what those operations mean geometrically is half the work of understanding why an algorithm does what it does.

1. Why linear algebra

Most things you measure end up in a list. A house has a list of features (square feet, bedrooms, year built, asking price). A photograph is a list of pixel intensities. A user's behaviour on a website is a list of click counts. Once you commit to representing the world as lists of numbers, you need a vocabulary for talking about them, and that vocabulary is linear algebra.

Two operations show up in essentially every algorithm.

  1. Adding two vectors. Combine two lists of numbers element-by-element. This is how you average data, how you compute residuals, how a neural network sums its inputs.
  2. Multiplying a vector by a matrix. Apply a linear transformation. This is how you rotate data, project it, change its coordinate system, or pass it through a layer of a neural network.

That is the whole list. Every other operation in this part is built from those two. The reason linear algebra dominates ML is not that the problems are inherently linear. It is that linear operations are tractable: they have closed-form solutions, fast algorithms, and clean geometric interpretations. Non-linear behaviour is then introduced in carefully controlled ways, on top of a linear backbone.

The promise of this part

By the end of Part I you will be able to look at an expression like \(\mathbf{x}^\top \mathbf{A} \mathbf{x}\) and immediately know its shape (it is a scalar), what \(\mathbf{A}\) is doing geometrically (it is acting on \(\mathbf{x}\) and the result is being measured against \(\mathbf{x}\) again), and how the answer changes if \(\mathbf{A}\) does. That kind of fluency is what makes the rest of ML readable.

Test yourself

Name three pieces of real-world data that you would naturally represent as a vector. For each, what would each entry of the vector mean?

2. Vectors

A vector is an ordered list of real numbers. It has a length (how many numbers it has, called its dimension) and you can do two operations to it without ever leaving the world of vectors.

\[ \mathbf{v} = \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix} \in \mathbb{R}^n \]

The notation \(\mathbb{R}^n\) means "the set of all vectors with \(n\) real-number entries." A vector in \(\mathbb{R}^2\) has two entries and you can draw it on a piece of paper. A vector in \(\mathbb{R}^3\) has three and you can draw it in space. A vector in \(\mathbb{R}^{784}\) is a flattened 28×28 grayscale image, one entry per pixel; you cannot draw it but the algebra works exactly the same.

The two operations

You can add two vectors of the same dimension by adding their entries:

\[ \begin{pmatrix} 1 \\ 2 \end{pmatrix} + \begin{pmatrix} 3 \\ 1 \end{pmatrix} = \begin{pmatrix} 4 \\ 3 \end{pmatrix} \]

You can multiply a vector by a scalar (a single number) by multiplying every entry:

\[ 2 \cdot \begin{pmatrix} 1 \\ 2 \end{pmatrix} = \begin{pmatrix} 2 \\ 4 \end{pmatrix} \]

That is it. From these two operations alone, all of linear algebra is built up. A set with these two operations satisfying a few obvious algebraic rules (commutativity, associativity, a zero element, distributivity) is called a vector space. The set \(\mathbb{R}^n\) is the most familiar one; later parts will use other vector spaces (functions, polynomials, matrices) where the same rules apply.

Interactive · 02aDrag the yellow tip a or the green tip b. The dashed green leg is b translated to a's tip; the white arrow from the origin to that point is a + b. Tip-to-tail addition lands at the same place as the straight route.

Geometric picture

It helps to think of a 2D vector as an arrow from the origin to a point. Adding two vectors then means: lay the second arrow's tail at the first arrow's tip. The sum is the arrow from the original origin to the final tip. Scalar multiplication stretches or shrinks (or flips, if the scalar is negative) without changing the direction. The widget above is built on this picture.

Test yourself

If \(\mathbf{a} = (3, 1)^\top\) and \(\mathbf{b} = (-1, 2)^\top\), what is \(\mathbf{a} + \mathbf{b}\)? What is \(2\mathbf{a} - \mathbf{b}\)? Sketch all four arrows.

3. Span and linear independence

Combine the two operations from the last module and you get the most useful single concept in linear algebra. Given a set of vectors \(\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k\), a linear combination of them is any expression of the form

\[ \alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k \]

where the \(\alpha_i\) are scalars. As you let the scalars vary over all real numbers, the set of all vectors you can produce is called the span of \(\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}\).

The span has a clean geometric interpretation. The span of one non-zero vector is the line through the origin in that vector's direction. The span of two non-parallel vectors in 3D is the plane through the origin containing them. The span of three vectors that do not lie in one plane is all of \(\mathbb{R}^3\). In general, the span of \(k\) vectors in \(\mathbb{R}^n\) is a flat subspace of \(\mathbb{R}^n\) passing through the origin, of some dimension between 0 and \(\min(k, n)\).

Linear independence

A set of vectors is linearly independent if none of them can be written as a linear combination of the others. Equivalently, the only way to write \(\mathbf{0}\) as a linear combination is with all coefficients zero:

\[ \alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0} \quad \implies \quad \alpha_1 = \cdots = \alpha_k = 0. \]

If they are not independent, they are linearly dependent: at least one is redundant, expressible as a combination of the others. Two parallel vectors in 2D are linearly dependent (one is a multiple of the other). Three vectors in 2D are always linearly dependent (they cannot all point in genuinely different directions when only two dimensions are available).

Subspaces

A subspace of \(\mathbb{R}^n\) is a non-empty subset that is closed under both vector addition and scalar multiplication. Adding two members or scaling one stays inside. Every span is a subspace; in fact, every subspace is the span of some set of vectors.

Subspaces are the structural objects you will care about in ML. The set of all images of dogs in a 784-dimensional pixel space is not a subspace (it is some weird curved manifold), but many useful approximations and projections do live in subspaces. PCA finds a low-dimensional subspace that captures most of the variance in a dataset. Linear regression projects the target onto a subspace spanned by the features. Knowing what a subspace is, and how to spot one, is foundational.

Interactive · 07aOne vector — span is a 1D subspace (a line). Drag the yellow tip; it stays on the line. Every point you can reach is a · v for some scalar a.
Interactive · 07bTwo vectors — span is a 2D subspace (a plane). Drag the yellow point; it slides on the plane and cannot leave it. The decomposition a · v + b · w rebuilds tip-to-tail.
Interactive · 07cThree vectors — span is all of R³. Drag P anywhere; nothing is off-limits. The three colored arrows redraw tip-to-tail with the unique coefficients a, b, c.
Test yourself

Are the vectors \((1,2)^\top\), \((2,4)^\top\), and \((1,0)^\top\) linearly independent in \(\mathbb{R}^2\)? What is the dimension of their span?

4. Basis and dimension

A basis of a vector space is a linearly independent set of vectors that spans the whole space. Two conditions, both required: independent (none redundant) and spanning (covers everything).

The most familiar basis for \(\mathbb{R}^n\) is the standard basis: the vectors

\[ \mathbf{e}_1 = \begin{pmatrix} 1 \\ 0 \\ 0 \\ \vdots \end{pmatrix}, \quad \mathbf{e}_2 = \begin{pmatrix} 0 \\ 1 \\ 0 \\ \vdots \end{pmatrix}, \quad \ldots, \quad \mathbf{e}_n = \begin{pmatrix} 0 \\ \vdots \\ 0 \\ 1 \end{pmatrix}. \]

Any vector can be written as a unique linear combination of these. The vector \((3, 5, 2)^\top\) is just \(3\mathbf{e}_1 + 5\mathbf{e}_2 + 2\mathbf{e}_3\); the entries themselves are the coefficients in this basis.

Dimension

The deep theorem of linear algebra (one of the few that you should genuinely keep in mind): every basis of a given vector space has the same number of elements. This number is called the dimension of the space. \(\mathbb{R}^n\) has dimension \(n\). A line through the origin has dimension 1. A plane through the origin has dimension 2.

Different bases give different coordinates for the same vector. A point in the plane has one set of \((x, y)\) coordinates in the standard basis, and a different pair of numbers in any rotated basis. The point itself does not change; only its description does. This idea, that a vector is independent of any particular basis, is the core insight of linear algebra. We will return to it formally in Module 9 when we discuss change of basis.

Test yourself

Are \((1,1)^\top\) and \((1,-1)^\top\) a basis of \(\mathbb{R}^2\)? If so, write the vector \((3, 5)^\top\) as a linear combination of them.

5. Matrices: data and transformation

A matrix is a rectangular array of numbers. An \(m \times n\) matrix has \(m\) rows and \(n\) columns. Matrices in machine learning play two distinct roles, and you should keep them separate in your head.

Role 1: A matrix as a stack of data points

Suppose you have 100 houses, each described by 4 features (square feet, bedrooms, year built, price). You can store them as a \(100 \times 4\) matrix \(\mathbf{X}\), where each row is one house and each column is one feature. This is the most common shape of "data" in ML libraries: rows are observations, columns are variables.

Role 2: A matrix as a function

An \(m \times n\) matrix \(\mathbf{A}\) also acts on a vector \(\mathbf{x} \in \mathbb{R}^n\) and produces a vector \(\mathbf{y} \in \mathbb{R}^m\) by the rule \(\mathbf{y} = \mathbf{A}\mathbf{x}\). When viewed this way, \(\mathbf{A}\) is a function from \(\mathbb{R}^n\) to \(\mathbb{R}^m\), and a special kind of function: a linear map. It satisfies

\[ \mathbf{A}(\alpha\mathbf{x} + \beta\mathbf{y}) = \alpha\mathbf{A}\mathbf{x} + \beta\mathbf{A}\mathbf{y} \]

for any vectors \(\mathbf{x}, \mathbf{y}\) and scalars \(\alpha, \beta\). It distributes over linear combinations. Geometrically, this means the matrix sends straight lines through the origin to straight lines through the origin, and parallel lines to parallel lines. It can stretch, rotate, reflect, project, and shear, but it cannot bend or curve.

Manim · 01The unit square under several linear transformations: rotation, shear, stretch, and projection.

The columns of A are where the basis vectors land

The single most useful fact about matrices: if \(\mathbf{A}\) is an \(m \times n\) matrix and \(\mathbf{e}_j\) is the \(j\)-th standard basis vector, then \(\mathbf{A}\mathbf{e}_j\) is the \(j\)-th column of \(\mathbf{A}\). So the columns of a matrix are exactly the images of the basis vectors under the transformation. This gives you a direct way to read off what a matrix does: look at where it sends \(\mathbf{e}_1, \mathbf{e}_2, \ldots\) and you have completely determined it.

For example, the matrix

\[ \mathbf{A} = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix} \]

sends \(\mathbf{e}_1 = (1,0)^\top\) to \((0,1)^\top\) and \(\mathbf{e}_2 = (0,1)^\top\) to \((-1, 0)^\top\). These two outputs are 90 degrees from where the inputs started. \(\mathbf{A}\) is a 90-degree counterclockwise rotation, and you read that off the columns.

Test yourself

Write down the \(2 \times 2\) matrix that scales x-coordinates by 3 and y-coordinates by 2. What does its action look like on the unit square?

6. Matrix arithmetic

Matrices add and scale element-wise, just like vectors. The interesting operation is multiplication.

Matrix-vector multiplication

If \(\mathbf{A}\) is \(m \times n\) and \(\mathbf{x}\) is in \(\mathbb{R}^n\), then \(\mathbf{A}\mathbf{x}\) is the vector in \(\mathbb{R}^m\) whose \(i\)-th entry is the dot product of the \(i\)-th row of \(\mathbf{A}\) with \(\mathbf{x}\):

\[ (\mathbf{A}\mathbf{x})_i = \sum_{j=1}^n A_{ij} x_j. \]

An equally useful interpretation: \(\mathbf{A}\mathbf{x}\) is a linear combination of the columns of \(\mathbf{A}\), with the entries of \(\mathbf{x}\) as coefficients. So

\[ \mathbf{A}\mathbf{x} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n, \]

where \(\mathbf{a}_j\) is the \(j\)-th column of \(\mathbf{A}\). Both views are correct and you should be fluent switching between them. The second is more useful when thinking geometrically: it says the matrix mixes its columns according to the recipe in \(\mathbf{x}\).

Matrix-matrix multiplication

If \(\mathbf{A}\) is \(m \times n\) and \(\mathbf{B}\) is \(n \times p\), then \(\mathbf{A}\mathbf{B}\) is the \(m \times p\) matrix whose \(j\)-th column is \(\mathbf{A}\) acting on the \(j\)-th column of \(\mathbf{B}\). Equivalently, \((\mathbf{A}\mathbf{B})_{ij} = \sum_k A_{ik} B_{kj}\). The number of columns of \(\mathbf{A}\) must match the number of rows of \(\mathbf{B}\); otherwise the product is undefined.

The cleanest way to think about \(\mathbf{A}\mathbf{B}\) is as function composition. The matrix \(\mathbf{B}\) is a linear map from \(\mathbb{R}^p\) to \(\mathbb{R}^n\). The matrix \(\mathbf{A}\) is a linear map from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). The product \(\mathbf{A}\mathbf{B}\) is the composed map: first apply \(\mathbf{B}\), then apply \(\mathbf{A}\), which gives a linear map from \(\mathbb{R}^p\) to \(\mathbb{R}^m\). This is why dimensions must agree, and why the result has the dimensions it does.

Matrix multiplication is not commutative

In general \(\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}\). Even when both products are defined and have the same shape, the results are usually different. Geometrically: rotating then stretching is not the same as stretching then rotating. The order of operations matters.

It is associative, though: \((\mathbf{A}\mathbf{B})\mathbf{C} = \mathbf{A}(\mathbf{B}\mathbf{C})\), so you can drop parentheses without ambiguity. And it distributes over addition: \(\mathbf{A}(\mathbf{B}+\mathbf{C}) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}\).

Test yourself

If \(\mathbf{A}\) is \(3 \times 4\) and \(\mathbf{B}\) is \(4 \times 2\), what shape is \(\mathbf{A}\mathbf{B}\)? Is \(\mathbf{B}\mathbf{A}\) defined?

7. Solving Ax = b

Given a matrix \(\mathbf{A}\) and a vector \(\mathbf{b}\), the equation \(\mathbf{A}\mathbf{x} = \mathbf{b}\) asks: which input \(\mathbf{x}\) does the linear map \(\mathbf{A}\) send to \(\mathbf{b}\)? This is the central problem of linear algebra and you will solve some version of it almost every day in ML, often without realising.

Three cases

Depending on \(\mathbf{A}\) and \(\mathbf{b}\), the equation \(\mathbf{A}\mathbf{x} = \mathbf{b}\) has zero, one, or infinitely many solutions.

  • Unique solution. When \(\mathbf{A}\) is square and "full rank" (we will define this precisely in Module 8), the equation has exactly one \(\mathbf{x}\). This is the case where the map \(\mathbf{A}\) is invertible.
  • No solution. When \(\mathbf{b}\) is not in the span of the columns of \(\mathbf{A}\), no input can possibly land at \(\mathbf{b}\). The system is inconsistent. This is the situation in linear regression, where the data is rarely exactly linear; we will use least squares to find the closest \(\mathbf{x}\) instead.
  • Infinitely many solutions. When the columns of \(\mathbf{A}\) are not independent (some are redundant), the kernel is non-trivial and you can add any kernel element to a solution and get another. This is what regularisation is for, in part: it picks one solution out of the infinite family.

Gaussian elimination

The classical algorithm for solving \(\mathbf{A}\mathbf{x} = \mathbf{b}\) when a unique solution exists. It applies elementary row operations (swap two rows, scale a row by a non-zero constant, add a multiple of one row to another) to the augmented matrix \([\mathbf{A} \mid \mathbf{b}]\) until \(\mathbf{A}\) has been reduced to upper-triangular form (or further, to reduced row echelon form). Then back-substitute.

You will rarely run Gaussian elimination by hand on anything bigger than \(3 \times 3\). Numerical libraries do it for you, with care for stability (pivoting) and efficiency. The reason to know how it works: every linear algebra question is answered by some refinement of this algorithm.

Geometric picture

\(\mathbf{A}\mathbf{x} = \mathbf{b}\) is asking "what combination of the columns of \(\mathbf{A}\) equals \(\mathbf{b}\)?" If \(\mathbf{b}\) is in the column span of \(\mathbf{A}\), the answer is yes there is at least one combination. If the columns are independent, the combination is unique. Both pieces of insight matter, and together they explain the three cases above.

Test yourself

For the system \(x + y = 3\), \(2x + 2y = 7\), explain in one sentence why there is no solution. What about \(x + y = 3\), \(2x + 2y = 6\)?

8. Inverse, transpose, rank

Three properties of a matrix you should be able to compute and reason about.

The inverse

If \(\mathbf{A}\) is square and there exists a matrix \(\mathbf{A}^{-1}\) such that \(\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}\) (the identity matrix), then \(\mathbf{A}\) is invertible. The inverse is the matrix that "undoes" \(\mathbf{A}\): if \(\mathbf{A}\) is a 90-degree rotation, \(\mathbf{A}^{-1}\) is a 90-degree rotation in the opposite direction.

Not every square matrix has an inverse. A matrix is invertible exactly when its columns are linearly independent (equivalently, its determinant is non-zero, or its kernel contains only \(\mathbf{0}\)). When an inverse exists, the unique solution to \(\mathbf{A}\mathbf{x} = \mathbf{b}\) is \(\mathbf{x} = \mathbf{A}^{-1}\mathbf{b}\). In practice you almost never form \(\mathbf{A}^{-1}\) explicitly; you solve \(\mathbf{A}\mathbf{x} = \mathbf{b}\) directly because that is faster and more numerically stable.

The transpose

The transpose of an \(m \times n\) matrix \(\mathbf{A}\), written \(\mathbf{A}^\top\), is the \(n \times m\) matrix you get by swapping rows and columns: \((\mathbf{A}^\top)_{ij} = A_{ji}\). The transpose has a few essential properties you will use constantly:

\[ (\mathbf{A}\mathbf{B})^\top = \mathbf{B}^\top \mathbf{A}^\top, \quad (\mathbf{A}^\top)^\top = \mathbf{A}, \quad (\mathbf{A}^{-1})^\top = (\mathbf{A}^\top)^{-1}. \]

A matrix that equals its own transpose is called symmetric. Symmetric matrices are everywhere in ML: covariance matrices are symmetric, the Hessian of a smooth function is symmetric, the Gram matrix \(\mathbf{X}^\top \mathbf{X}\) is symmetric. They have especially nice spectral properties (Module 18).

Rank

The rank of a matrix is the dimension of the span of its columns (equivalently, the dimension of the span of its rows). It tells you how much "real information" is in the matrix. A \(5 \times 3\) matrix can have rank at most 3. If its rank is 3, all three columns are linearly independent and the matrix is "full column rank." If the rank is 2, two columns are independent and the third is some combination of them.

Rank shows up in three guises that you should learn to recognise:

  1. The number of independent rows (or columns) of \(\mathbf{A}\).
  2. The dimension of the image of the map \(\mathbf{A}\): how many dimensions of \(\mathbb{R}^m\) are actually reached by \(\mathbf{A}\mathbf{x}\) as \(\mathbf{x}\) ranges over \(\mathbb{R}^n\).
  3. The number of non-zero singular values of \(\mathbf{A}\) (we get to singular values in Module 21).

All three definitions agree.

Test yourself

What is the rank of the matrix \(\begin{pmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \end{pmatrix}\)? Is it invertible? What is the dimension of its column space?

9. Linear maps and change of basis

A vector exists independently of any particular description. The pair \((3, 5)\) only makes sense once you have agreed on a basis. With the standard basis it means "3 to the right and 5 up." With a rotated basis it would mean something else. The vector itself is the same arrow in space; the coordinates depend on the choice of basis.

This means that a single linear map can be represented by many different matrices, depending on which bases you use for the input and output spaces. The map is the abstract object; the matrix is its description in coordinates.

The change-of-basis formula

Let \(T : \mathbb{R}^n \to \mathbb{R}^n\) be a linear map represented by a matrix \(\mathbf{A}\) in the standard basis. Suppose you choose a different basis whose vectors form the columns of an invertible matrix \(\mathbf{P}\). The matrix that represents \(T\) in this new basis is

\[ \mathbf{A}' = \mathbf{P}^{-1} \mathbf{A} \mathbf{P}. \]

The recipe: change coordinates from new-basis to standard via \(\mathbf{P}\), apply the map \(\mathbf{A}\) in standard coordinates, then change back to new-basis coordinates via \(\mathbf{P}^{-1}\). The composition is one matrix that does the whole thing in new-basis coordinates.

This formula is responsible for some of the most useful constructions in ML. Diagonalisation, which we cover in Module 19, is the operation of finding a special basis in which a matrix becomes diagonal. PCA changes basis to one in which the data has uncorrelated coordinates. Eigendecomposition, SVD, and Jordan form are all change-of-basis statements at heart.

Matrices with respect to two bases

For a map between different spaces (say \(T : \mathbb{R}^n \to \mathbb{R}^m\)), you choose a basis for each. If \(\mathbf{P}\) is the matrix whose columns are the new basis of \(\mathbb{R}^n\) (in standard coordinates), and \(\mathbf{Q}\) is the matrix whose columns are the new basis of \(\mathbb{R}^m\), then a map represented by \(\mathbf{A}\) in standard bases is represented by \(\mathbf{Q}^{-1}\mathbf{A}\mathbf{P}\) in the new ones. Same logic, two changes of basis.

Manim · 08Same vector, two coordinate systems. The arrow stays put. Only its description changes.
Test yourself

If \(\mathbf{A}\) is the matrix of \(T\) in the standard basis, and \(\mathbf{P}\) is the matrix whose columns are the new basis vectors, what does \(\mathbf{P}^{-1}\mathbf{A}\mathbf{P}\) compute? What does each of the three operations represent geometrically?

10. Image, kernel, rank-nullity

Two subspaces are associated with every linear map, and they explain almost everything about how the map behaves.

Image (column space)

The image of \(\mathbf{A}\), written \(\text{Im}(\mathbf{A})\) or \(\text{col}(\mathbf{A})\), is the set of all outputs:

\[ \text{Im}(\mathbf{A}) = \{\mathbf{A}\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\}. \]

It is a subspace of \(\mathbb{R}^m\) (the codomain). It is exactly the span of the columns of \(\mathbf{A}\), which is why "image" and "column space" are the same thing. Its dimension equals the rank of \(\mathbf{A}\).

Kernel (null space)

The kernel of \(\mathbf{A}\), written \(\ker(\mathbf{A})\) or \(\text{null}(\mathbf{A})\), is the set of inputs that get mapped to zero:

\[ \ker(\mathbf{A}) = \{\mathbf{x} \in \mathbb{R}^n : \mathbf{A}\mathbf{x} = \mathbf{0}\}. \]

It is a subspace of \(\mathbb{R}^n\) (the domain). Its dimension is called the nullity. The kernel measures how much the map "collapses": if \(\ker(\mathbf{A}) = \{\mathbf{0}\}\) then \(\mathbf{A}\) is injective and you can recover \(\mathbf{x}\) from \(\mathbf{A}\mathbf{x}\). If the kernel is non-trivial, then many different inputs share the same output.

The rank-nullity theorem

The bookkeeping rule that ties them together. For any \(m \times n\) matrix:

\[ \text{rank}(\mathbf{A}) + \text{nullity}(\mathbf{A}) = n. \]

The total dimension of the input space splits cleanly into "the dimensions that survive" (the rank, image dimension) and "the dimensions that get squashed to zero" (the nullity). This is one of the few theorems in this part you should keep on hand. It explains why a tall skinny matrix (\(m > n\) and full column rank) has trivial kernel, why a wide short matrix has non-trivial kernel, and why the dimensions of the various subspaces always add up correctly.

Why this matters in ML

Linear regression on a dataset of \(N\) points with \(D\) features uses an \(N \times D\) feature matrix. If \(N > D\) and the features are independent, the kernel is trivial and a unique least-squares solution exists. If \(D > N\) (more features than data points, the high-dimensional regime), the kernel is non-trivial, the system is underdetermined, and you need regularisation to pick one solution. PCA finds a low-dimensional image; the rest of the original space gets sent to a particular orthogonal complement that you can ignore. These all hinge on the image-kernel split.

That closes Part I. Linear algebra is the deepest tool in this note and we will use it constantly. The next part adds geometry on top: lengths, angles, projections, rotations.

Test yourself

For a \(4 \times 6\) matrix of rank 3, what is the dimension of the kernel? What is the dimension of the image?

Part II. Analytic geometry

Lengths, angles, and projections.

Linear algebra gives you addition and scaling. Geometry adds three things: how long is a vector, what is the angle between two vectors, and how do you project one vector onto another. The whole of analytic geometry is built from one extra ingredient called the inner product.

11. Norms and the geometry of vectors

A norm is a function that assigns a non-negative length to every vector. It must satisfy three properties: it is zero only for the zero vector, it scales linearly with scaling (\(\|c\mathbf{v}\| = |c|\|\mathbf{v}\|\)), and it satisfies the triangle inequality (\(\|\mathbf{u} + \mathbf{v}\| \le \|\mathbf{u}\| + \|\mathbf{v}\|\)).

The norm you are most familiar with is the Euclidean norm (also called the \(\ell_2\) norm):

\[ \|\mathbf{v}\|_2 = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2}. \]

It is the straight-line distance from the origin to the point \(\mathbf{v}\). Pythagoras's theorem in \(n\) dimensions.

Other useful norms

The \(\ell_1\) norm sums absolute values: \(\|\mathbf{v}\|_1 = |v_1| + |v_2| + \cdots + |v_n|\). It is the "taxicab" distance: how far you would walk if you could only move along the axes. It plays a starring role in regularisation (LASSO regression) because it encourages sparse solutions: minimising it tends to set entries to exactly zero rather than to small non-zero values.

The \(\ell_\infty\) norm takes the largest absolute value: \(\|\mathbf{v}\|_\infty = \max_i |v_i|\). It is the worst-case coordinate.

Each norm carves out a different "unit ball" (the set of vectors with norm at most 1). The Euclidean unit ball is a disc. The \(\ell_1\) ball is a diamond. The \(\ell_\infty\) ball is a square. The shape of the unit ball is what gives each norm its character.

Manim · 09The L2, L1, and L∞ unit balls. Same set of vectors, three different ways of measuring.

Why norms matter

Norms measure error. When you fit a model and ask "how far is the prediction from the truth," you are computing a norm of the residual. Different norms give different fits. Squared Euclidean error gives least-squares regression, which is sensitive to outliers because it weights large errors quadratically. \(\ell_1\) error gives least absolute deviations regression, which is robust to outliers but harder to compute. The choice of norm encodes the loss function.

Test yourself

For \(\mathbf{v} = (3, 4)^\top\), compute \(\|\mathbf{v}\|_1\), \(\|\mathbf{v}\|_2\), and \(\|\mathbf{v}\|_\infty\). Which norm gives the largest value? Which gives the smallest?

12. Inner products: angles, distances

Norms tell you "how big." The inner product tells you "how aligned." The standard inner product on \(\mathbb{R}^n\) is the dot product:

\[ \langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top \mathbf{v} = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n. \]

It produces a single number from two vectors. The geometric interpretation: \(\langle \mathbf{u}, \mathbf{v} \rangle = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta\), where \(\theta\) is the angle between the two vectors. From this you can immediately read three facts:

  • If \(\langle \mathbf{u}, \mathbf{v} \rangle = 0\) and neither vector is zero, the vectors are orthogonal (perpendicular). The dot product is the cleanest test for perpendicularity.
  • If \(\langle \mathbf{u}, \mathbf{v} \rangle > 0\), they point in roughly the same direction (acute angle). If negative, they point in opposite directions (obtuse angle).
  • The Euclidean norm comes from the inner product: \(\|\mathbf{v}\|_2 = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}\). Squared length equals self-inner-product.

The Cauchy-Schwarz inequality

One of the most-used inequalities in mathematics: for any inner product,

\[ |\langle \mathbf{u}, \mathbf{v} \rangle| \le \|\mathbf{u}\| \cdot \|\mathbf{v}\|, \]

with equality exactly when one vector is a scalar multiple of the other. This is a consequence of \(|\cos\theta| \le 1\), but the inequality holds for any inner product, not just the dot product. It is the engine behind the triangle inequality and shows up in proofs of convergence everywhere.

Distance, computed via inner product

The squared distance between two vectors is:

\[ \|\mathbf{u} - \mathbf{v}\|^2 = \langle \mathbf{u} - \mathbf{v}, \mathbf{u} - \mathbf{v} \rangle = \|\mathbf{u}\|^2 - 2\langle \mathbf{u}, \mathbf{v} \rangle + \|\mathbf{v}\|^2. \]

Expanding this way is something you will do many times. It is also why squared distances are often more convenient than distances: they are quadratic in the inputs, which means they are differentiable, which means you can optimise them with calculus.

General inner products

The dot product is one inner product on \(\mathbb{R}^n\), but it is not the only one. A general inner product is any function \(\langle \cdot, \cdot \rangle\) that is bilinear, symmetric, and positive definite (\(\langle \mathbf{v}, \mathbf{v} \rangle \ge 0\) with equality only at \(\mathbf{v} = \mathbf{0}\)). Any symmetric positive-definite matrix \(\mathbf{M}\) defines an inner product by \(\langle \mathbf{u}, \mathbf{v} \rangle_\mathbf{M} = \mathbf{u}^\top \mathbf{M} \mathbf{v}\). When you encounter a "Mahalanobis distance" or a "weighted inner product," it is one of these.

Test yourself

The vectors \(\mathbf{u} = (1, 2)^\top\) and \(\mathbf{v} = (3, -1)^\top\). Compute their inner product. Are they orthogonal? What is the cosine of the angle between them?

13. Orthonormal bases and Gram-Schmidt

Bases come in two flavours that are technically equivalent but practically very different. Any linearly independent spanning set works as a basis. But an orthonormal basis is one where the basis vectors are pairwise orthogonal and each has length 1. The standard basis \(\{\mathbf{e}_1, \ldots, \mathbf{e}_n\}\) is orthonormal. Most other bases are not.

Why orthonormal bases are special: in an orthonormal basis, the coordinates of a vector are just its inner products with the basis vectors. If \(\{\mathbf{q}_1, \ldots, \mathbf{q}_n\}\) is orthonormal, then for any \(\mathbf{v}\):

\[ \mathbf{v} = \sum_{i=1}^n \langle \mathbf{v}, \mathbf{q}_i \rangle \, \mathbf{q}_i. \]

No system of equations to solve, no matrix to invert. Just a dot product per coordinate. Compare this to a general basis, where you have to solve \(\mathbf{B}\mathbf{x} = \mathbf{v}\) to find the coordinates. Orthonormal bases give you coordinates for free.

The Gram-Schmidt procedure

Given any linearly independent set, you can produce an orthonormal basis for the same span using Gram-Schmidt. The algorithm:

  1. Take the first vector and normalise it: \(\mathbf{q}_1 = \mathbf{v}_1 / \|\mathbf{v}_1\|\).
  2. For each subsequent \(\mathbf{v}_k\), subtract off its projections onto the already-found \(\mathbf{q}_i\)'s, then normalise: \(\mathbf{w}_k = \mathbf{v}_k - \sum_{i<k} \langle \mathbf{v}_k, \mathbf{q}_i \rangle \mathbf{q}_i\), then \(\mathbf{q}_k = \mathbf{w}_k / \|\mathbf{w}_k\|\).

The result is a sequence \(\mathbf{q}_1, \mathbf{q}_2, \ldots\) that is orthonormal and spans the same subspace as the original sequence. In numerical practice, the classical Gram-Schmidt is unstable, and the modified Gram-Schmidt or Householder reflections are used instead. Both produce the same theoretical result.

Gram-Schmidt is the engine behind QR decomposition, which factors any matrix \(\mathbf{A} = \mathbf{Q}\mathbf{R}\) where \(\mathbf{Q}\) has orthonormal columns and \(\mathbf{R}\) is upper triangular. QR is the workhorse for solving least-squares problems.

Test yourself

Apply Gram-Schmidt to the vectors \(\mathbf{v}_1 = (1, 1)^\top\) and \(\mathbf{v}_2 = (1, 0)^\top\). What orthonormal basis do you get?

14. Orthogonal projections

An orthogonal projection sends a vector to its closest point in a subspace. Given a subspace \(U \subset \mathbb{R}^n\) and a vector \(\mathbf{v}\), the projection \(\pi_U(\mathbf{v})\) is the unique vector in \(U\) such that the difference \(\mathbf{v} - \pi_U(\mathbf{v})\) is orthogonal to every vector in \(U\).

Projection onto a line

The simplest case: project \(\mathbf{v}\) onto the line spanned by a unit vector \(\mathbf{u}\). The answer is

\[ \pi_{\text{line}}(\mathbf{v}) = \langle \mathbf{v}, \mathbf{u} \rangle \, \mathbf{u}. \]

The inner product picks out the length of the shadow; the direction is the unit vector. This formula is everywhere.

Projection onto a subspace

If \(U\) has an orthonormal basis \(\{\mathbf{q}_1, \ldots, \mathbf{q}_k\}\), the projection sums the line projections:

\[ \pi_U(\mathbf{v}) = \sum_{i=1}^k \langle \mathbf{v}, \mathbf{q}_i \rangle \, \mathbf{q}_i. \]

If you stack the basis vectors as columns of a matrix \(\mathbf{Q}\), this becomes \(\pi_U(\mathbf{v}) = \mathbf{Q}\mathbf{Q}^\top \mathbf{v}\). The matrix \(\mathbf{P} = \mathbf{Q}\mathbf{Q}^\top\) is the projection matrix onto \(U\). It satisfies \(\mathbf{P}^2 = \mathbf{P}\) (projecting twice is the same as projecting once) and \(\mathbf{P} = \mathbf{P}^\top\) (it is symmetric for orthogonal projections).

Projection onto the column space of a general matrix

If \(\mathbf{A}\) is an \(n \times k\) matrix with full column rank (but its columns are not necessarily orthonormal), the projection onto its column space is

\[ \mathbf{P} = \mathbf{A}(\mathbf{A}^\top \mathbf{A})^{-1}\mathbf{A}^\top. \]

This formula will become familiar when we reach linear regression in Part VIII. The least-squares solution is exactly the orthogonal projection of the target onto the column space of the feature matrix.

Why projections matter

Projections are how you take a point and find its closest representative in a constrained set. PCA is a sequence of projections. Linear regression is a projection. The Kalman filter is a projection. Anywhere you have a "closest point in a subspace" problem, the answer is a projection, and the geometry is what you saw above.

Manim · 10Several vectors v projected onto a fixed line. The residual is always perpendicular to the line.
Test yourself

Project the vector \((3, 4)^\top\) onto the line spanned by \((1, 0)^\top\). What is the residual (the part of the vector that is orthogonal to the line)?

15. Rotations

A rotation is a linear map that preserves both lengths and angles. In 2D, every rotation is given by a matrix of the form

\[ \mathbf{R}(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}, \]

which rotates counterclockwise by angle \(\theta\). The columns of \(\mathbf{R}(\theta)\) are exactly the images of \(\mathbf{e}_1\) and \(\mathbf{e}_2\) after rotation: \((\cos\theta, \sin\theta)^\top\) and \((-\sin\theta, \cos\theta)^\top\). The minus sign in the second column comes from the geometry: rotating \(\mathbf{e}_2\) counterclockwise by \(\theta\) takes it into the second quadrant.

Properties

Rotation matrices are orthogonal: they satisfy \(\mathbf{R}^\top \mathbf{R} = \mathbf{I}\), or equivalently \(\mathbf{R}^{-1} = \mathbf{R}^\top\). The inverse of a rotation is just its transpose, and the transpose is the rotation by \(-\theta\). The determinant of a rotation matrix is exactly 1. (Orthogonal matrices with determinant \(-1\) are reflections.)

Composing rotations adds the angles: \(\mathbf{R}(\alpha) \mathbf{R}(\beta) = \mathbf{R}(\alpha + \beta)\). This is one of the few cases where matrix multiplication is commutative, because rotation by \(\alpha\) and then \(\beta\) is the same as rotating by \(\beta\) then \(\alpha\) when both are around the same axis.

Higher dimensions

In 3D, a rotation is a matrix that rotates around some axis. There are several common parameterisations: Euler angles, axis-angle, rotation matrices directly, or unit quaternions. Each has trade-offs (Euler angles have the gimbal lock problem; quaternions have a sign ambiguity). For ML purposes, the rotation matrix form is enough: an orthogonal matrix with determinant 1.

In \(n\) dimensions, the set of all rotation matrices forms a group called \(SO(n)\) (the "special orthogonal group"). It comes up whenever your data lives on a sphere or you need to rotate a high-dimensional embedding without distortion.

Test yourself

Apply the 90-degree rotation matrix \(\mathbf{R}(\pi/2)\) to the vector \((1, 0)^\top\). What do you get? Does the answer match your geometric expectation?

16. Inner products of functions

Everything in this part has been about \(\mathbb{R}^n\), but the linear-algebra framework extends, almost without modification, to function spaces. The vector space here is the set of functions on some interval, and the inner product is an integral instead of a sum:

\[ \langle f, g \rangle = \int_a^b f(x) g(x) \, dx. \]

This satisfies all the inner product axioms: bilinear, symmetric, and positive definite (assuming we restrict to functions for which the integral exists and is finite). The corresponding norm is

\[ \|f\|_2 = \sqrt{\int_a^b f(x)^2 \, dx}, \]

called the \(L^2\) norm. The angle between two functions, the orthogonality of two functions, and the projection of one function onto another all transfer directly from the vector case. Two functions are orthogonal if their inner product (the integral of their pointwise product) is zero.

Why this matters

Many tools in ML and signal processing rely on functions being treated as vectors in an infinite-dimensional space. Fourier analysis writes a function as a sum of orthogonal sines and cosines: it is a basis expansion in a function space. Wavelets do the same with a different orthonormal basis. Reproducing kernel Hilbert spaces (the formal foundation of kernel methods, including the kernel SVM in Part XI) are inner-product function spaces with an extra structural property. All of these inherit the geometric intuition from \(\mathbb{R}^n\).

You will rarely need to compute these integrals by hand for ML, but you should know that the framework exists and that the same projection-and-norm machinery applies. When you read about a "function space" or an "RKHS" or a "Hilbert space," it is the same set of ideas as Part II, scaled up.

Test yourself

Are the functions \(\sin(x)\) and \(\cos(x)\) orthogonal on the interval \([0, 2\pi]\)? What about \(\sin(x)\) and \(\sin(2x)\)?

Part III. Matrix decompositions

Factoring matrices into useful pieces.

A square matrix is hard to understand at a glance, but every interesting matrix can be factored into a product of matrices that are individually easy. Each factorisation reveals a different structural truth. Eigendecomposition reveals which directions are stretched and which are unchanged. SVD reveals the same idea for any matrix, square or not. The decompositions are how you actually see what a matrix is doing.

17. Determinant and trace

Two single numbers you can compute from a square matrix that summarise it. Both will appear constantly.

The determinant

The determinant of a square matrix \(\mathbf{A}\), written \(\det(\mathbf{A})\) or \(|\mathbf{A}|\), is the signed volume scaling factor of the linear map \(\mathbf{A}\). If you take the unit cube and apply \(\mathbf{A}\) to it, the result is a parallelepiped whose volume is \(|\det(\mathbf{A})|\). The sign is positive if \(\mathbf{A}\) preserves orientation and negative if it flips it.

Three consequences of this definition:

  • \(\det(\mathbf{A}) = 0\) means the unit cube collapses to a lower-dimensional shape. Equivalently: the columns of \(\mathbf{A}\) are linearly dependent, the kernel is non-trivial, and \(\mathbf{A}\) is not invertible. Determinant zero is the signature of a non-invertible matrix.
  • \(\det(\mathbf{A}\mathbf{B}) = \det(\mathbf{A}) \det(\mathbf{B})\). Composition of maps multiplies their volume scalings.
  • \(\det(\mathbf{A}^{-1}) = 1/\det(\mathbf{A})\) when the inverse exists. Undoing a stretch is the reciprocal stretch.

For small matrices the formulas are explicit. The 2x2 determinant is \(\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc\). The 3x3 case has six terms (the Sarrus rule). For general \(n\), the determinant is a sum of \(n!\) terms, which is computationally infeasible past about \(n=4\). In practice you compute \(\det(\mathbf{A})\) via LU decomposition, where it falls out as the product of the diagonal of the upper triangular factor.

The trace

The trace of a square matrix is the sum of its diagonal entries: \(\text{tr}(\mathbf{A}) = A_{11} + A_{22} + \cdots + A_{nn}\). Trace has fewer geometric properties than the determinant, but it has one near-magical algebraic property:

\[ \text{tr}(\mathbf{A}\mathbf{B}) = \text{tr}(\mathbf{B}\mathbf{A}). \]

Even when \(\mathbf{A}\mathbf{B}\) and \(\mathbf{B}\mathbf{A}\) are different matrices (or different shapes), their traces are equal. This trick (called the cyclic property of the trace, since you can also rotate triple products) is used constantly in matrix calculus and in deriving update rules for ML algorithms.

Trace also equals the sum of the eigenvalues (Module 18), and determinant equals their product. Two ways to summarise spectral information without diagonalising.

Test yourself

Compute the determinant and trace of \(\mathbf{A} = \begin{pmatrix} 3 & 1 \\ 2 & 4 \end{pmatrix}\). Is it invertible?

18. Eigenvalues and eigenvectors

An eigenvector of a square matrix \(\mathbf{A}\) is a non-zero vector \(\mathbf{v}\) that does not change direction when \(\mathbf{A}\) acts on it. It might get longer, shorter, or flipped, but it stays on its own line:

\[ \mathbf{A}\mathbf{v} = \lambda \mathbf{v}. \]

The scalar \(\lambda\) is the corresponding eigenvalue: it tells you how much \(\mathbf{v}\) gets stretched. Eigenvalues can be positive (stretches), negative (flips and stretches), zero (the eigenvector lies in the kernel), or even complex (when \(\mathbf{A}\) acts as a rotation in some plane).

Finding them

Rearranging \(\mathbf{A}\mathbf{v} = \lambda\mathbf{v}\) gives \((\mathbf{A} - \lambda \mathbf{I})\mathbf{v} = \mathbf{0}\). For non-zero \(\mathbf{v}\) to exist, \(\mathbf{A} - \lambda\mathbf{I}\) must be singular, which means its determinant is zero:

\[ \det(\mathbf{A} - \lambda \mathbf{I}) = 0. \]

This is the characteristic equation. It is a polynomial in \(\lambda\) of degree \(n\) (for an \(n \times n\) matrix). Its roots are the eigenvalues, counted with multiplicity. The fundamental theorem of algebra says it has exactly \(n\) roots over the complex numbers, though some may be repeated and some may be complex even when the matrix is real.

For each eigenvalue \(\lambda_i\), the corresponding eigenvectors form the kernel of \(\mathbf{A} - \lambda_i \mathbf{I}\). You find them by solving the linear system.

Manim · 11A grid of unit vectors after passing through a 2x2 matrix. Most rotate; two special directions (yellow) stay on their own line. Those are the eigenvectors.

The geometric picture

For a 2x2 matrix in the real plane, the eigenvectors are the directions that the matrix leaves invariant. If you draw a circle of vectors and apply \(\mathbf{A}\) to each, most will rotate to a new direction. The eigenvectors are the rare ones that stay on their own line. The eigenvalues say by how much they got scaled.

Special cases worth remembering

  • Symmetric matrices have real eigenvalues and orthogonal eigenvectors. This is one of the most-used facts in ML.
  • Positive-definite matrices (symmetric with all positive eigenvalues) define inner products, generate ellipses (not just any quadric), and are the matrices you can take a Cholesky decomposition of.
  • Orthogonal matrices (rotations and reflections) have eigenvalues with absolute value 1. Rotations have complex-conjugate pairs of eigenvalues on the unit circle. Reflections have a real eigenvalue of \(-1\) (the flipped direction) and one of \(+1\) (the axis of reflection).
Test yourself

The matrix \(\mathbf{A} = \begin{pmatrix} 2 & 0 \\ 0 & 3 \end{pmatrix}\) has obvious eigenvectors and eigenvalues. What are they? Why does the structure of this matrix make it easy?

19. Diagonalisation

If a square matrix \(\mathbf{A}\) has \(n\) linearly independent eigenvectors, you can write

\[ \mathbf{A} = \mathbf{P}\mathbf{D}\mathbf{P}^{-1}, \]

where \(\mathbf{P}\) is the matrix whose columns are the eigenvectors and \(\mathbf{D}\) is the diagonal matrix of corresponding eigenvalues. This is called diagonalisation, and you should read the formula geometrically.

\(\mathbf{P}^{-1}\): change basis from standard coordinates to the eigenvector basis. \(\mathbf{D}\): in the eigenvector basis, the action of \(\mathbf{A}\) is just stretching each axis independently. \(\mathbf{P}\): change basis back. The complicated map \(\mathbf{A}\) becomes the simple map \(\mathbf{D}\) once you choose the right coordinate system.

Manim · 02The action of a matrix as rotate-stretch-rotate-back. Eigenvectors are the directions where the rotation steps cancel.

When can you diagonalise?

Not every matrix is diagonalisable. The condition is that the matrix has \(n\) linearly independent eigenvectors. Some matrices have repeated eigenvalues and not enough eigenvectors to fill out a full basis (defective matrices). Examples include the shear matrix \(\begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}\), which has eigenvalue 1 with only a 1D eigenspace.

A guaranteed-diagonalisable case: symmetric matrices are always diagonalisable, and you can pick the eigenvectors to be orthonormal. The result is the spectral theorem:

\[ \mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^\top, \]

where \(\mathbf{Q}\) is orthogonal (\(\mathbf{Q}^\top = \mathbf{Q}^{-1}\)). Symmetric matrices are everywhere in ML (covariance, Gram matrix, Hessian), and the spectral theorem is what lets us always factor them this way.

Computing matrix powers

One immediate payoff of diagonalisation: \(\mathbf{A}^k = \mathbf{P}\mathbf{D}^k \mathbf{P}^{-1}\), and \(\mathbf{D}^k\) is just the diagonal matrix of \(\lambda_i^k\). Powers of a diagonalisable matrix are computed in \(O(n^3)\) plus \(n\) scalar exponentiations, which is dramatically faster than naively multiplying \(k\) times. This shows up in PageRank, Markov chains, and any iterated linear dynamic.

Test yourself

Why is the matrix \(\begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}\) not diagonalisable? Find its eigenvalues and eigenvectors and explain the issue.

20. Cholesky decomposition

For symmetric positive-definite matrices, there is an even sharper factorisation: the Cholesky decomposition.

\[ \mathbf{A} = \mathbf{L}\mathbf{L}^\top, \]

where \(\mathbf{L}\) is lower triangular with positive diagonal. Cholesky exists if and only if \(\mathbf{A}\) is symmetric positive-definite, and when it does, it is unique. The algorithm to compute it is essentially Gaussian elimination tailored to the symmetric case, and it is twice as fast as a general LU decomposition.

Where it shows up

  • Solving \(\mathbf{A}\mathbf{x} = \mathbf{b}\) for symmetric positive-definite \(\mathbf{A}\). Find \(\mathbf{L}\), then solve \(\mathbf{L}\mathbf{y} = \mathbf{b}\) by forward substitution and \(\mathbf{L}^\top \mathbf{x} = \mathbf{y}\) by back substitution. Both triangular solves are \(O(n^2)\) and the factorisation is \(O(n^3 / 3)\).
  • Sampling from a multivariate Gaussian \(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\). If \(\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^\top\), then \(\boldsymbol{\mu} + \mathbf{L}\mathbf{z}\) where \(\mathbf{z}\) is standard normal has the right covariance. We will revisit this in Part V.
  • Computing log-determinants. \(\log \det \mathbf{A} = 2 \sum_i \log L_{ii}\). This is needed in Gaussian process and Bayesian methods, where you compute log-likelihoods.

Cholesky is the standard tool for "I have a positive-definite matrix and I need to do something with it numerically." If your matrix is barely positive-definite (close to singular), Cholesky gives you the cleanest signal: the algorithm fails (tries to take the square root of a negative number) at the first eigenvalue near zero.

Test yourself

The matrix \(\begin{pmatrix} 4 & 2 \\ 2 & 5 \end{pmatrix}\) is symmetric positive-definite. Find its Cholesky decomposition by hand.

21. Singular value decomposition

Eigendecomposition only works for square matrices, and even then only when there are enough eigenvectors. The singular value decomposition (SVD) generalises the same idea to any matrix.

Every \(m \times n\) matrix \(\mathbf{A}\) (real or complex, square or rectangular, full rank or not) can be factored as

\[ \mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top, \]

where \(\mathbf{U}\) is an \(m \times m\) orthogonal matrix, \(\mathbf{V}\) is an \(n \times n\) orthogonal matrix, and \(\mathbf{\Sigma}\) is an \(m \times n\) "diagonal" matrix (zero except on the main diagonal) whose diagonal entries are non-negative. These diagonal entries are the singular values of \(\mathbf{A}\), conventionally ordered \(\sigma_1 \ge \sigma_2 \ge \cdots \ge 0\).

Geometric picture

Read \(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\) right to left as the action on a vector: \(\mathbf{V}^\top\) rotates the input into a special basis (the right singular vectors). \(\mathbf{\Sigma}\) stretches each axis by the corresponding singular value (and zeros out axes that have no contribution to the output, if \(m > n\) or \(\mathbf{A}\) is not full rank). \(\mathbf{U}\) rotates the result into the output space.

Equivalently: any linear map is "rotate, stretch each axis independently, rotate again." That is the entire content of SVD. Eigendecomposition is the special case where the input rotation and the output rotation are the same.

Manim · 03SVD applied to a random 2x2 matrix: rotate, stretch, rotate, with the singular values shown explicitly.

What the singular values tell you

The number of non-zero singular values is the rank of \(\mathbf{A}\). The largest singular value \(\sigma_1\) is the operator norm: it is the largest factor by which \(\mathbf{A}\) can stretch any unit vector. The ratio \(\sigma_1 / \sigma_n\) (when \(\sigma_n > 0\)) is the condition number, which measures how sensitive \(\mathbf{A}\mathbf{x} = \mathbf{b}\) is to perturbations: a high condition number means small errors in \(\mathbf{b}\) cause large errors in \(\mathbf{x}\).

Relationship to eigendecomposition

The columns of \(\mathbf{V}\) are the eigenvectors of \(\mathbf{A}^\top \mathbf{A}\). The columns of \(\mathbf{U}\) are the eigenvectors of \(\mathbf{A}\mathbf{A}^\top\). The singular values squared are the eigenvalues of either of these symmetric matrices. So if you know how to compute symmetric eigendecomposition, you know how to compute SVD.

Test yourself

If \(\mathbf{A}\) is a \(5 \times 3\) matrix of rank 2, how many singular values does it have? How many of those are zero?

22. Low-rank approximation

The single most important practical use of SVD: compressing matrices.

Write \(\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\). Each column of \(\mathbf{U}\) and \(\mathbf{V}\) pairs up with one singular value to give a rank-one matrix \(\sigma_i \mathbf{u}_i \mathbf{v}_i^\top\). The full SVD is the sum of these:

\[ \mathbf{A} = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^\top, \]

where \(r\) is the rank. The truncated SVD keeps only the first \(k\) terms (the \(k\) largest singular values):

\[ \mathbf{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top. \]

The Eckart-Young theorem says \(\mathbf{A}_k\) is the best rank-\(k\) approximation to \(\mathbf{A}\) in both Frobenius norm and operator norm. There is no other rank-\(k\) matrix closer to \(\mathbf{A}\).

Why this matters

Most real-world matrices are approximately low-rank. An image matrix has highly correlated rows and columns; most of its singular values are tiny. A document-term matrix is dominated by a small number of topics. A user-movie ratings matrix is well-approximated by a few latent factors. In each case, replacing \(\mathbf{A}\) with \(\mathbf{A}_k\) for a modest \(k\) loses very little information but compresses the matrix from \(mn\) numbers down to \(k(m+n+1)\) numbers (the singular values, the columns of \(\mathbf{U}_k\), and the columns of \(\mathbf{V}_k\)).

This is the engine behind PCA (Part IX), latent semantic analysis, matrix factorisation for collaborative filtering, image compression, and a long list of other techniques. They are all variations on "take SVD, keep the top \(k\)."

The "energy" perspective

The squared Frobenius norm of \(\mathbf{A}\) equals the sum of squared singular values: \(\|\mathbf{A}\|_F^2 = \sum_i \sigma_i^2\). The fraction of "energy" captured by the rank-\(k\) approximation is \(\sum_{i \le k} \sigma_i^2 / \sum_i \sigma_i^2\). When this fraction is close to 1 for small \(k\), the matrix is effectively low-rank and compression works well. Plotting singular values in descending order is a quick visual diagnostic for this.

Test yourself

If a \(1000 \times 1000\) matrix has singular values \(100, 95, 80, 5, 4, 3, \ldots\) (sharp drop after the first three), what rank \(k\) would give a good approximation?

23. The phylogeny of matrices

Different matrices admit different decompositions. Knowing which one to reach for is half the work of efficient numerical linear algebra.

Matrix propertyBest decompositionCost
Any matrixSVD: \(\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\)\(O(mn \min(m,n))\)
Square, diagonalisableEigendecomp: \(\mathbf{A} = \mathbf{P}\mathbf{D}\mathbf{P}^{-1}\)\(O(n^3)\)
SymmetricSpectral: \(\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^\top\)\(O(n^3)\)
Symmetric positive-definiteCholesky: \(\mathbf{A} = \mathbf{L}\mathbf{L}^\top\)\(O(n^3 / 3)\)
Tall and full rankQR: \(\mathbf{A} = \mathbf{Q}\mathbf{R}\)\(O(mn^2)\)
Square, genericLU: \(\mathbf{A} = \mathbf{L}\mathbf{U}\)\(O(2n^3 / 3)\)

What each is good for

  • SVD. Universally applicable. Use it when you need rank, when you need to compress, when you need the pseudoinverse, when nothing else fits.
  • Eigendecomposition. Square matrices. Use it for matrix powers, exponentials, principal directions of action.
  • Spectral (symmetric eigendecomp). The special case for symmetric matrices. Use it for covariance matrices, Hessians, kernel matrices.
  • Cholesky. Symmetric positive-definite. Use it for Gaussian sampling, fast solving of linear systems, computing log-determinants.
  • QR. Use it for least-squares problems and Gram-Schmidt.
  • LU. The default for "solve any square system." It is what your numerical library does behind the scenes when you call solve(A, b).

That closes Part III. Decompositions are how you really see what a matrix does, and they are the engine inside every machine learning algorithm that involves a matrix problem (which is most of them). The next part adds calculus on top: how matrix-valued functions change with respect to their inputs.

Test yourself

Given a covariance matrix and asked to sample from the corresponding Gaussian, which decomposition would you reach for? Why?

Part IV. Vector calculus

How a function of many variables changes.

Single-variable calculus tells you how a function of one variable changes. Vector calculus extends that to functions of many variables, and to functions whose outputs are themselves vectors or matrices. Almost every machine learning algorithm trains by computing a gradient and stepping in the direction it points. To understand training, you need to be fluent in the gradient.

24. Differentiation, refresher

The derivative of a single-variable function \(f : \mathbb{R} \to \mathbb{R}\) at a point \(x\) is the limit

\[ f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}. \]

Geometrically: the slope of the tangent line at \(x\). Algorithmically: the local rate of change. The function increases at rate \(f'(x)\) per unit increase in \(x\) near that point.

The derivatives you should be able to compute in your sleep:

  • Power rule. \(\frac{d}{dx} x^n = n x^{n-1}\).
  • Sum rule. \((f + g)' = f' + g'\).
  • Product rule. \((fg)' = f'g + fg'\).
  • Quotient rule. \((f/g)' = (f'g - fg')/g^2\).
  • Chain rule. \(\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)\). The most important rule. We will see it again in higher dimensions.

The functions you should know derivatives of by sight: \(e^x \to e^x\), \(\log x \to 1/x\), \(\sin x \to \cos x\), \(\cos x \to -\sin x\), \(\tan x \to \sec^2 x\). Plus the inverse functions: \(\arcsin' x = 1/\sqrt{1-x^2}\), and similar.

Numerical reality

In practice, almost no derivative in machine learning is computed by hand. Frameworks do automatic differentiation: they record the computation graph as you build it, then traverse it backwards to compute gradients. We will look at how this works in Module 29. The hand computation is still useful because it gives you an intuition for which functions are easy to differentiate and which are pathological. Things like absolute values and sharp corners are pathological; smooth functions of standard pieces are easy.

Test yourself

What is the derivative of \(f(x) = \log(1 + e^x)\)? (This is the "softplus" function used in neural networks. Its derivative has a famous form.)

25. Partial derivatives and the gradient

For a function of multiple variables \(f : \mathbb{R}^n \to \mathbb{R}\), there is no single "derivative." Instead, there is one partial derivative for each input variable, holding the others fixed:

\[ \frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(\ldots, x_i + h, \ldots) - f(\ldots, x_i, \ldots)}{h}. \]

It is the rate of change of \(f\) when you move in the \(i\)-th coordinate direction only.

Stack all the partial derivatives together and you get the gradient:

\[ \nabla f(\mathbf{x}) = \begin{pmatrix} \partial f / \partial x_1 \\ \partial f / \partial x_2 \\ \vdots \\ \partial f / \partial x_n \end{pmatrix}. \]

The gradient is a vector. It lives in the same space as the input \(\mathbf{x}\), not in the output space. The two non-negotiable facts about it:

  1. It points in the direction of steepest ascent. Among all directions you could step in from \(\mathbf{x}\), the gradient direction is the one along which \(f\) increases fastest.
  2. Its length is the rate of that increase. The function increases by \(\|\nabla f\| \cdot \epsilon\) (to first order) per step of size \(\epsilon\) in the gradient direction.

Both facts follow from the directional derivative: the rate of change of \(f\) in direction \(\mathbf{u}\) (a unit vector) is \(\nabla f \cdot \mathbf{u}\). By Cauchy-Schwarz (Module 12), this is maximised when \(\mathbf{u}\) is parallel to \(\nabla f\).

Critical points

A point where \(\nabla f = \mathbf{0}\) is a critical point. At a critical point, the function is locally flat: any direction you move, the function does not change to first order. Critical points come in three flavours: local minima, local maxima, and saddle points. To distinguish them you need second-order information (the Hessian, Module 30).

Optimisation, in the simplest framing, is "find a critical point that is also a minimum." We will spend most of Part VI on the algorithms for doing this.

Test yourself

For \(f(x, y) = x^2 + 3xy + y^2\), compute \(\nabla f\). What is its value at the origin? What does that tell you?

26. The Jacobian

For a function \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\) (vector input, vector output), the gradient generalises to the Jacobian: an \(m \times n\) matrix of all the partial derivatives.

\[ \mathbf{J}_{\mathbf{f}}(\mathbf{x}) = \begin{pmatrix} \partial f_1 / \partial x_1 & \cdots & \partial f_1 / \partial x_n \\ \vdots & \ddots & \vdots \\ \partial f_m / \partial x_1 & \cdots & \partial f_m / \partial x_n \end{pmatrix}. \]

The Jacobian is the best linear approximation of \(\mathbf{f}\) at the point \(\mathbf{x}\). Near \(\mathbf{x}\), the function behaves approximately as

\[ \mathbf{f}(\mathbf{x} + \mathbf{h}) \approx \mathbf{f}(\mathbf{x}) + \mathbf{J}_{\mathbf{f}}(\mathbf{x}) \mathbf{h}. \]

This is the multidimensional version of the tangent line approximation. The Jacobian matrix \(\mathbf{J}_{\mathbf{f}}(\mathbf{x})\) is the linear map that captures, locally, how the function deforms space.

Special cases worth naming

  • If \(\mathbf{f}\) outputs a scalar (\(m = 1\)), the Jacobian is a \(1 \times n\) row vector. By convention, its transpose is the gradient.
  • If \(\mathbf{f}\) takes a scalar input (\(n = 1\)), the Jacobian is an \(m \times 1\) column vector: just the derivative of each output component.
  • If \(\mathbf{f}\) is itself linear (i.e., \(\mathbf{f}(\mathbf{x}) = \mathbf{A}\mathbf{x}\)), the Jacobian is constantly \(\mathbf{A}\) at every point. Linear functions are their own best linear approximation.

Determinant of the Jacobian

For a square Jacobian (\(m = n\)), the determinant tells you the local volume scaling of the map. This is the change-of-variables formula in multidimensional integration: when you change variables from \(\mathbf{x}\) to \(\mathbf{u} = \mathbf{f}(\mathbf{x})\), the volume element transforms as \(d\mathbf{u} = |\det \mathbf{J}_{\mathbf{f}}| \, d\mathbf{x}\). It is the multidimensional analogue of \(du = f'(x) \, dx\) in single-variable substitution.

This shows up in probability when you push a random variable through a function and want to track how its density changes. We will use it explicitly in Module 41.

Test yourself

For \(\mathbf{f}(x, y) = (x^2 + y, xy)\), compute the Jacobian. What is its determinant?

27. Matrix calculus rules

Matrix calculus is what you actually do in ML: take a derivative of a scalar with respect to a vector or matrix, or a vector with respect to a vector. The rules are the same as ordinary calculus, but the bookkeeping is fiddly because shapes have to align.

Two conventions

For \(f : \mathbb{R}^n \to \mathbb{R}\), there are two ways to organise \(\partial f / \partial \mathbf{x}\). The "numerator layout" makes it a row vector. The "denominator layout" makes it a column vector. Modern ML mostly uses the denominator layout, where the gradient is a column vector matching the shape of \(\mathbf{x}\). This note uses denominator layout throughout.

Useful identities

For vector \(\mathbf{x}\) and matrix \(\mathbf{A}\):

\[ \frac{\partial (\mathbf{a}^\top \mathbf{x})}{\partial \mathbf{x}} = \mathbf{a}, \quad \frac{\partial (\mathbf{x}^\top \mathbf{x})}{\partial \mathbf{x}} = 2\mathbf{x}, \quad \frac{\partial (\mathbf{x}^\top \mathbf{A} \mathbf{x})}{\partial \mathbf{x}} = (\mathbf{A} + \mathbf{A}^\top) \mathbf{x}. \]

The third one collapses to \(2\mathbf{A}\mathbf{x}\) when \(\mathbf{A}\) is symmetric, which it is in essentially every place this expression shows up in ML (least squares, quadratic forms, Hessians).

For matrix \(\mathbf{X}\):

\[ \frac{\partial \, \text{tr}(\mathbf{A}\mathbf{X})}{\partial \mathbf{X}} = \mathbf{A}^\top, \quad \frac{\partial \, \text{tr}(\mathbf{X}^\top \mathbf{A} \mathbf{X})}{\partial \mathbf{X}} = (\mathbf{A} + \mathbf{A}^\top) \mathbf{X}. \]

Where it gets messy

The minute you write a derivative of a matrix-valued function with respect to a matrix, you have a four-index tensor on your hands. You can vectorise to keep things in matrix form, or you can use Einstein summation conventions, or you can just trust your autodiff library. In practice, hand-derivations stop at the level of "scalar with respect to a vector or matrix"; anything more goes through automatic differentiation (Module 29).

The single best resource for matrix calculus identities is Petersen and Pedersen's "The Matrix Cookbook," which is freely available. Bookmark it. The first two pages contain about 80% of what you need.

Test yourself

For the squared-error loss \(L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2\), where \(\mathbf{y}\) and \(\mathbf{X}\) are fixed, compute \(\partial L / \partial \mathbf{w}\).

28. The chain rule, in full

The single-variable chain rule says \(\frac{d}{dx}f(g(x)) = f'(g(x)) g'(x)\). The multivariate version is the same idea: derivatives compose by multiplication. The bookkeeping just gets fancier.

The general statement

For composed functions \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\) and \(\mathbf{g} : \mathbb{R}^p \to \mathbb{R}^n\), the Jacobian of \(\mathbf{f} \circ \mathbf{g}\) at a point \(\mathbf{x}\) is the product of Jacobians:

\[ \mathbf{J}_{\mathbf{f} \circ \mathbf{g}}(\mathbf{x}) = \mathbf{J}_{\mathbf{f}}(\mathbf{g}(\mathbf{x})) \cdot \mathbf{J}_{\mathbf{g}}(\mathbf{x}). \]

Read it left to right: first apply the outer function's Jacobian (evaluated at the inner function's output), then the inner function's Jacobian (evaluated at the original input). The dimensions line up: \((m \times n)\) times \((n \times p)\) gives \((m \times p)\), which is the correct shape for the Jacobian of a function from \(\mathbb{R}^p\) to \(\mathbb{R}^m\).

The scalar-output special case

If the final output is a scalar (the loss, in ML), the chain rule reduces to a series of matrix-vector multiplications. For \(L = f(\mathbf{a})\) where \(\mathbf{a} = \mathbf{g}(\mathbf{x})\):

\[ \frac{\partial L}{\partial \mathbf{x}} = \mathbf{J}_{\mathbf{g}}(\mathbf{x})^\top \cdot \frac{\partial L}{\partial \mathbf{a}}. \]

The transpose of the Jacobian acts on the upstream gradient to produce the downstream gradient. This single equation is the engine of backpropagation.

Why this is the entire story of ML training

A neural network is a deeply composed function: \(L = \ell(f_K(f_{K-1}(\cdots f_1(\mathbf{x}))))\), where each \(f_i\) is one layer (a linear map followed by a non-linearity). To compute \(\partial L / \partial \theta_i\) for some parameter in layer \(i\), you apply the chain rule \(K - i + 1\) times. Conceptually, that is the whole gradient computation.

The clever part is the order of evaluation, which we cover next.

Test yourself

For \(L = (\sigma(\mathbf{w}^\top \mathbf{x}) - y)^2\) (squared error of a single neuron), apply the chain rule to compute \(\partial L / \partial \mathbf{w}\).

29. Backpropagation

Suppose your computation graph is a long chain: \(\mathbf{x}_0 \to \mathbf{x}_1 \to \mathbf{x}_2 \to \cdots \to \mathbf{x}_K = L\). The chain rule says \(\partial L / \partial \mathbf{x}_0\) is a product of \(K\) Jacobians. The question is in what order you multiply them.

Two natural orders:

  • Forward mode. Start with the identity at the input, multiply by the first Jacobian, multiply by the second, and so on. After step \(k\) you have \(\partial \mathbf{x}_k / \partial \mathbf{x}_0\). At the end you have the full gradient.
  • Reverse mode. Start with the identity at the output, multiply by the last Jacobian (transposed), then the second-to-last, and so on. After step \(k\) you have \(\partial L / \partial \mathbf{x}_{K-k}\). At the end you have what you want.

Mathematically, both produce the same answer. Computationally, they differ enormously. The cost of multiplying a chain of matrices \(\mathbf{A}_1 \mathbf{A}_2 \cdots \mathbf{A}_K\) depends on the shapes. Forward mode is cheap when the input is low-dimensional and the output is high-dimensional. Reverse mode is cheap when the input is high-dimensional and the output is low-dimensional.

In ML, the loss is a single scalar (output dimension 1) and the parameters are very high-dimensional (millions, sometimes billions). Reverse mode is dramatically faster. Backpropagation is reverse-mode automatic differentiation applied to neural network training. The algorithm is generic and works on any computation graph.

The two-pass structure

To compute the gradient of a loss with respect to all parameters, autodiff frameworks do two passes through the computation graph:

  1. Forward pass. Evaluate the function at the input, recording the value of every intermediate quantity. Output: the loss, plus a saved tape of intermediate values.
  2. Backward pass. Walk the tape in reverse, accumulating gradients. At each node you know the upstream gradient (\(\partial L / \partial \mathbf{x}_k\)) and you compute the local Jacobian and use the chain rule to get the gradient at the next node back.

The total cost is roughly twice the cost of the forward pass, regardless of the number of parameters. This is the deep algorithmic fact that makes deep learning practical.

Test yourself

Why is reverse-mode autodiff (backprop) faster than forward-mode for neural network training? Why is the opposite true for, say, computing all sensitivities of a single-output simulator with respect to its many inputs? (Trick question: it isn't.)

30. The Hessian and second-order info

The gradient is a first-order object: it tells you the slope. To know whether you are at a minimum, a maximum, or a saddle point, you need second derivatives.

The Hessian of a scalar function \(f : \mathbb{R}^n \to \mathbb{R}\) is the \(n \times n\) matrix of second partial derivatives:

\[ \mathbf{H}_f(\mathbf{x})_{ij} = \frac{\partial^2 f}{\partial x_i \, \partial x_j}. \]

Symmetric, by the equality of mixed partials (assuming \(f\) is twice continuously differentiable, which it is for everything you will see in this note).

What the Hessian tells you

At a critical point (\(\nabla f = \mathbf{0}\)), the Hessian classifies it:

  • If \(\mathbf{H}\) is positive definite (all eigenvalues positive), the critical point is a local minimum. The function curves upward in every direction.
  • If negative definite (all eigenvalues negative), local maximum.
  • If the eigenvalues are mixed (some positive, some negative), it is a saddle point. The function curves up in some directions and down in others.
  • If some eigenvalues are zero, the second derivative test is inconclusive; you need higher-order information.

Newton's method

Gradient descent uses only first-order information and takes steps proportional to the gradient. Newton's method uses both: it solves the local quadratic approximation exactly. The update rule is

\[ \mathbf{x}_{k+1} = \mathbf{x}_k - \mathbf{H}_f(\mathbf{x}_k)^{-1} \nabla f(\mathbf{x}_k). \]

It converges quadratically near a minimum (the number of correct digits doubles each step), but each step requires solving an \(n \times n\) linear system, which is \(O(n^3)\). For ML problems with millions of parameters, full Newton is infeasible. Quasi-Newton methods like L-BFGS approximate the Hessian inverse with a few low-rank updates and are widely used for medium-scale problems.

For deep learning, plain stochastic gradient descent and its momentum-based variants are usually preferred over Newton, because computing or even approximating the Hessian is too expensive. We come back to this in Part VI.

Test yourself

For \(f(x, y) = x^2 + 4xy + y^2\), compute the Hessian. Is the origin a minimum, maximum, or saddle point?

31. Multivariate Taylor series

The single-variable Taylor series approximates a function near a point as a polynomial:

\[ f(x + h) = f(x) + f'(x) h + \tfrac{1}{2} f''(x) h^2 + \cdots. \]

The multivariate version uses gradient and Hessian:

\[ f(\mathbf{x} + \mathbf{h}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^\top \mathbf{h} + \tfrac{1}{2} \mathbf{h}^\top \mathbf{H}_f(\mathbf{x}) \mathbf{h}. \]

Truncated at first order, it gives the tangent plane (the multidimensional analogue of the tangent line). Truncated at second order, it gives the local quadratic bowl that captures both slope and curvature. Higher-order terms involve tensors of partial derivatives and are rarely worth the trouble.

Where it shows up

  • Newton's method minimises this second-order approximation exactly, then re-approximates at the new point.
  • Local convergence analysis of optimisation algorithms uses Taylor expansions to argue about behaviour near a critical point.
  • Trust region methods use a quadratic Taylor model within a region they "trust" to be accurate, and shrink the region if the actual function disagrees.
  • Linearisation in physics and signal processing is the first-order Taylor approximation: replace the non-linear function by its tangent and analyse a much simpler linear problem.

That closes Part IV. You can now compute gradients, Jacobians, and Hessians, and you understand backpropagation as reverse-mode chain rule. The next part is a parallel chapter on probability, which gives you the language to talk about uncertain data.

Test yourself

Write the second-order Taylor expansion of \(f(x, y) = e^{xy}\) around the origin.

Part V. Probability

The language of uncertain data.

Real data is noisy. Real predictions are uncertain. Real models are wrong in interesting ways. Probability is the formal language for talking about all of that. Once you can write a model as a joint distribution and update it with data, you have the framework that nearly every machine learning algorithm is a special case of.

32. What probability is

A probability is a number between 0 and 1 assigned to an event. The interpretation has been argued for three centuries and there are two camps that you should know about.

Frequentist

"The probability of an event is the long-run fraction of trials in which it occurs." Flip a fair coin many times; the fraction of heads converges to 1/2. Probability is a property of the world, measured by repeating an experiment.

Bayesian

"The probability of an event is your degree of belief in it." There may be no repetition; the event might be a one-off (will it rain tomorrow? Did the defendant commit the crime?). Probability is a property of the observer's knowledge, updated by data via Bayes' rule.

The two interpretations agree in cases where both make sense, and modern ML uses both fluidly without much philosophical anxiety. We will use whichever makes a particular argument easier. Both views obey the same mathematical rules, which is what we care about.

The axioms

Whatever a probability "is," it satisfies three rules. For events on a sample space \(\Omega\):

  1. \(P(A) \ge 0\) for any event \(A\).
  2. \(P(\Omega) = 1\). Something must happen.
  3. If \(A_1, A_2, \ldots\) are pairwise disjoint events, \(P(A_1 \cup A_2 \cup \cdots) = P(A_1) + P(A_2) + \cdots\). The probability of "any of these" is the sum of probabilities, when no two can co-occur.

From these three you derive everything else: \(P(\emptyset) = 0\), \(P(A^c) = 1 - P(A)\), \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\), and so on.

Test yourself

If \(P(A) = 0.6\), \(P(B) = 0.5\), and \(P(A \cap B) = 0.3\), what is \(P(A \cup B)\)? What is \(P(A^c \cap B^c)\)?

33. Discrete random variables

A random variable is a function from the sample space to the real numbers. Roll a die; the random variable might be "the value rolled" (1 through 6) or "1 if even, 0 if odd." Practically, you can ignore the formalism and just think of a random variable as "a number whose value is uncertain."

A discrete random variable takes values in a countable set (often a finite set, like \(\{1, 2, 3, 4, 5, 6\}\) or \(\{0, 1\}\)). It is described by its probability mass function (PMF), which assigns a probability to each possible value:

\[ p(x) = P(X = x), \quad \sum_x p(x) = 1. \]

Common discrete distributions

  • Bernoulli\((\theta)\). Single coin flip with bias \(\theta\). Takes value 1 with probability \(\theta\), 0 with probability \(1 - \theta\). The simplest non-trivial distribution.
  • Binomial\((n, \theta)\). Sum of \(n\) independent Bernoulli\((\theta)\). Counts how many heads in \(n\) flips. PMF: \(p(k) = \binom{n}{k} \theta^k (1-\theta)^{n-k}\).
  • Categorical\((\boldsymbol{\theta})\). One draw from a finite set with probabilities \(\theta_1, \theta_2, \ldots, \theta_K\) (which sum to 1). Bernoulli is the \(K=2\) case.
  • Poisson\((\lambda)\). Counts of rare events in a fixed window: emails per minute, defects per chip, soldiers in 1898 Prussia kicked by horses. PMF: \(p(k) = \lambda^k e^{-\lambda} / k!\) for \(k = 0, 1, 2, \ldots\).
  • Geometric\((\theta)\). Number of failures before the first success in independent Bernoulli\((\theta)\) trials.

Each one is a model of a specific data-generating process. Picking the right distribution for your problem is half the work of probabilistic modelling.

Test yourself

You flip a coin 10 times and want to know the probability of exactly 3 heads. Which distribution describes this, and what is the answer in terms of its PMF?

34. Continuous random variables

A continuous random variable can take values in an uncountable set, typically an interval of real numbers. The probability of any single exact value is zero (the temperature being exactly 23.000000...°C has probability zero). Instead, you describe a continuous random variable by a probability density function (PDF), \(p(x)\), which has the property that

\[ P(a \le X \le b) = \int_a^b p(x) \, dx. \]

The integral over the whole real line equals 1. Densities are not probabilities, they have units (probability per unit \(x\)), and a density value can be larger than 1, but they can be integrated to give probabilities of intervals.

The cumulative distribution function

Easier to interpret than the density: the CDF is

\[ F(x) = P(X \le x) = \int_{-\infty}^x p(t) \, dt. \]

Always between 0 and 1. Always non-decreasing. Approaches 0 at \(-\infty\) and 1 at \(+\infty\). The CDF is well-defined for any real-valued random variable, discrete or continuous, which makes it the most universal description.

Common continuous distributions

  • Uniform\([a, b]\). Constant density on \([a, b]\), zero elsewhere. The simplest continuous distribution.
  • Exponential\((\lambda)\). Density \(p(x) = \lambda e^{-\lambda x}\) for \(x \ge 0\). Models waiting times: how long until the next event in a Poisson process.
  • Gaussian (Normal) \(\mathcal{N}(\mu, \sigma^2)\). The bell curve, density \(p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)}\). Module 39 is dedicated to it; you will see it everywhere.
  • Beta\((\alpha, \beta)\). Distribution on \([0, 1]\), used to model probabilities themselves. Conjugate prior to the Bernoulli.
  • Gamma\((\alpha, \beta)\). Generalisation of the exponential. Used to model positive-valued quantities like waiting times and rates.
Test yourself

For an Exponential\((\lambda)\) random variable, what is the probability that \(X > c\)? Compute it from the density.

35. Joint, marginal, conditional

When you have multiple random variables, three questions arise: what is the joint behaviour, what does each look like by itself, and what does one look like given a value of another?

Joint distribution

The joint distribution of two random variables \(X\) and \(Y\) describes the probability of every pair of values:

\[ p(x, y) = P(X = x, Y = y) \quad \text{(discrete)}, \quad \int\!\int p(x, y) \, dx \, dy = 1 \quad \text{(continuous)}. \]

Marginal distribution

To get the distribution of \(X\) alone (forgetting about \(Y\)), sum or integrate out the other variable:

\[ p(x) = \sum_y p(x, y) \quad \text{or} \quad p(x) = \int p(x, y) \, dy. \]

This is called marginalisation. It is the workhorse operation of probabilistic inference: whenever you have a joint distribution involving variables you do not care about, you marginalise them away.

Conditional distribution

The distribution of \(X\) given that you know \(Y = y\) is

\[ p(x \mid y) = \frac{p(x, y)}{p(y)}. \]

Read it as "the joint divided by the marginal of the variable you are conditioning on." The conditional is itself a distribution: it sums (or integrates) to 1 over \(x\) for any fixed \(y\) where \(p(y) > 0\).

The product rule

Rearranging the conditional formula gives

\[ p(x, y) = p(x \mid y) \, p(y) = p(y \mid x) \, p(x). \]

The joint factors into a marginal and a conditional, in either order. This is the product rule of probability and it is the algebraic engine of Bayes' theorem.

Test yourself

You have a joint distribution \(p(x, y)\) given as a table. How do you read off the marginal \(p(x)\) and the conditional \(p(y \mid x)\)?

36. Bayes' theorem

Combine the product rule with itself: \(p(x, y) = p(x \mid y) p(y) = p(y \mid x) p(x)\). Solve for \(p(x \mid y)\):

\[ p(x \mid y) = \frac{p(y \mid x) \, p(x)}{p(y)}. \]

That is Bayes' theorem. Three lines of algebra; an entire philosophy of inference.

The four pieces

In ML the formula is usually written with \(\theta\) for parameters and \(\mathcal{D}\) for data:

\[ \underbrace{p(\theta \mid \mathcal{D})}_{\text{posterior}} = \frac{\overbrace{p(\mathcal{D} \mid \theta)}^{\text{likelihood}} \, \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(\mathcal{D})}_{\text{evidence}}}. \]

The posterior is what you want: the updated belief about parameters after seeing data. The likelihood is your model: how likely is this data, given particular parameters. The prior is your belief before seeing the data. The evidence is the marginal probability of the data; it is the normalising constant that makes the posterior a proper distribution.

The evidence is often the hardest piece to compute. For most non-trivial models it requires an integral over all possible parameter values, which is intractable in closed form. This is why so much of probabilistic ML is about approximating the evidence (variational inference, MCMC, etc.).

The classical example: medical testing

A test for a rare disease has 99% sensitivity (true positive rate) and 99% specificity (true negative rate). The disease has prevalence 0.1% in the population. You test positive. What is the probability you have the disease?

Apply Bayes:

\[ P(D \mid +) = \frac{P(+ \mid D) P(D)}{P(+)} = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.01 \times 0.999} \approx 0.09. \]

Nine percent. Even with a 99%-accurate test, the rarity of the disease means most positives are false positives. This counterintuitive answer is the canonical example of why prior matters and why Bayesian reasoning is non-negotiable in medical and forensic settings.

Manim · 12A Beta prior on a coin's bias updating to a posterior as flips arrive one at a time. The distribution sharpens around the truth.
Test yourself

Re-do the medical test computation if the disease has 10% prevalence instead of 0.1%. What is \(P(D \mid +)\) now?

37. Expectation, variance, covariance

Three numerical summaries of a random variable that you should be able to compute in your sleep.

Expectation

The expectation (or mean) of a random variable \(X\) is its weighted average:

\[ \mathbb{E}[X] = \sum_x x \, p(x) \quad \text{or} \quad \mathbb{E}[X] = \int x \, p(x) \, dx. \]

It is the centre of mass of the distribution. It is linear: \(\mathbb{E}[aX + bY] = a \mathbb{E}[X] + b \mathbb{E}[Y]\) for any random variables \(X\) and \(Y\) and any scalars \(a, b\). Linearity of expectation does not require independence; it is one of the cleanest tools in all of probability.

Variance

The variance measures spread:

\[ \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2. \]

The square root is the standard deviation, which has the same units as \(X\). Variance is non-linear: \(\text{Var}(aX) = a^2 \text{Var}(X)\), and \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2 \text{Cov}(X, Y)\).

Covariance

The covariance of two random variables is

\[ \text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]. \]

It measures how much they move together. Positive covariance means \(X\) and \(Y\) tend to be both above their means together (and both below). Negative covariance means they tend to move oppositely. Zero covariance means they are uncorrelated, but not necessarily independent.

The covariance matrix

For a random vector \(\mathbf{X} = (X_1, \ldots, X_n)^\top\), the covariance matrix \(\boldsymbol{\Sigma}\) has entries \(\Sigma_{ij} = \text{Cov}(X_i, X_j)\). It is symmetric (covariance is symmetric in its arguments) and positive semidefinite (any quadratic form \(\mathbf{a}^\top \boldsymbol{\Sigma} \mathbf{a} = \text{Var}(\mathbf{a}^\top \mathbf{X}) \ge 0\)). The diagonal entries are the variances of each component; the off-diagonal entries are the covariances between pairs.

Covariance matrices are the central object of multivariate statistics. PCA finds their eigendecomposition. Gaussian processes parameterise their covariance with a kernel function. The Mahalanobis distance is a quadratic form using the inverse covariance.

Test yourself

Roll a fair die. What is the expectation? What is the variance?

38. Independence

Two random variables \(X\) and \(Y\) are independent if knowing one tells you nothing about the other. Formally:

\[ p(x, y) = p(x) \, p(y) \quad \text{for all } x, y. \]

Equivalently \(p(x \mid y) = p(x)\), the conditional equals the marginal. Several properties follow:

  • \(\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]\). Expectations of products factor for independent random variables.
  • \(\text{Cov}(X, Y) = 0\). Independent implies uncorrelated.
  • \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\). Variances add.

The reverse implications are not all true. Uncorrelated does not imply independent in general. They are equivalent only for jointly Gaussian random variables, which is one of the things that makes Gaussians so pleasant.

Conditional independence

\(X\) and \(Y\) are conditionally independent given \(Z\) if

\[ p(x, y \mid z) = p(x \mid z) \, p(y \mid z). \]

Knowing \(Z\) makes \(X\) and \(Y\) independent of each other. This is weaker than full independence: \(X\) and \(Y\) might be correlated marginally, but the correlation is "explained" by \(Z\). Conditional independence assumptions are the structural backbone of graphical models (Bayesian networks and Markov random fields), where missing edges between nodes encode conditional independencies.

Why independence matters

Most ML training assumes the data points are iid: independent and identically distributed. This assumption simplifies the math (the likelihood factors into a product over points) and is approximately true for many real datasets. When it fails (time series, sequences with structure), you have to use models that explicitly account for the dependence (HMMs, RNNs, transformers).

Test yourself

Two coins are flipped independently. Let \(X\) be the result of the first, \(Y\) the result of the second, and \(Z = X \oplus Y\) (their XOR). Are \(X\) and \(Y\) independent? Are \(X\) and \(Z\)? Are \(X\) and \(Z\) given \(Y\)?

39. The Gaussian distribution

The most important continuous distribution in machine learning. Two reasons. First, it shows up naturally as the limit of sums (the central limit theorem). Second, it has unusually pleasant algebraic properties: it is closed under linear operations, conditioning, and marginalisation.

Manim · 13The central limit theorem in action. Sums of N uniform random variables, standardised, march toward the bell curve as N grows.

The univariate Gaussian

\[ \mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left( -\frac{(x - \mu)^2}{2\sigma^2} \right). \]

Two parameters: the mean \(\mu\) and the variance \(\sigma^2\). The density is symmetric about \(\mu\), peaks at \(\mu\), and decays exponentially in the squared distance. About 68% of the mass is within one standard deviation, 95% within two, 99.7% within three.

The multivariate Gaussian

For a random vector \(\mathbf{X} \in \mathbb{R}^n\):

\[ \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{n/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\!\left( -\tfrac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right). \]

Two parameters: a mean vector \(\boldsymbol{\mu} \in \mathbb{R}^n\) and a positive-definite covariance matrix \(\boldsymbol{\Sigma}\). The density's level sets are ellipsoids centered at \(\boldsymbol{\mu}\), with shape and orientation determined by the eigenvectors and eigenvalues of \(\boldsymbol{\Sigma}\).

Manim · 14The level sets of a 2D Gaussian as the covariance changes from isotropic (circles) to anisotropic (axis-aligned ellipses) to correlated (rotated ellipses).

Properties that make Gaussians special

  • Linear transformations preserve Gaussianity. If \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) and \(\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}\), then \(\mathbf{Y} \sim \mathcal{N}(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^\top)\).
  • Marginals are Gaussian. If you marginalise out some components of a multivariate Gaussian, the result is still Gaussian on the remaining components.
  • Conditionals are Gaussian. Conditioning on a subset of components also gives a Gaussian, and there are explicit formulas for the conditional mean and covariance.
  • Sums are Gaussian. If \(\mathbf{X}\) and \(\mathbf{Y}\) are independent Gaussians, \(\mathbf{X} + \mathbf{Y}\) is Gaussian.

No other distribution has this combination. It is why Gaussians dominate Kalman filters, Gaussian processes, mixture models, Bayesian linear regression, and most of probabilistic ML.

Sampling from a Gaussian

To sample \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\): sample \(\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) (each component an independent standard normal), then set \(\mathbf{X} = \boldsymbol{\mu} + \mathbf{L}\mathbf{Z}\) where \(\mathbf{L}\) is the Cholesky factor of \(\boldsymbol{\Sigma}\) (Module 20). The transformation has the right mean and covariance, and a linear function of a Gaussian is Gaussian.

Test yourself

For a multivariate Gaussian with diagonal covariance \(\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \ldots, \sigma_n^2)\), what does the level-set ellipsoid look like? What are the principal axes?

40. Conjugate priors and the exponential family

Bayesian inference asks you to compute the posterior \(p(\theta \mid \mathcal{D}) \propto p(\mathcal{D} \mid \theta) p(\theta)\). For arbitrary priors and likelihoods, this product gives a posterior in some unfamiliar functional form, and computing the normalising constant is hard. Conjugate priors are special prior choices that make the posterior have the same functional form as the prior.

Bernoulli-Beta

The classical example. If your data is Bernoulli with parameter \(\theta\), and you put a Beta\((\alpha, \beta)\) prior on \(\theta\), the posterior is Beta\((\alpha + s, \beta + n - s)\), where \(n\) is the number of trials and \(s\) is the number of successes. The posterior is just a Beta with updated parameters; you do not have to compute any integrals.

Gaussian-Gaussian

If your data is Gaussian with known variance and unknown mean, and you put a Gaussian prior on the mean, the posterior is Gaussian. The Gaussian is conjugate to itself (with respect to the mean parameter). This is what makes Bayesian linear regression analytically tractable (Module 56).

The exponential family

These pleasant pairings are not coincidences. They are special cases of a deeper structure. A distribution is in the exponential family if its density can be written as

\[ p(\mathbf{x} \mid \boldsymbol{\eta}) = h(\mathbf{x}) \exp\!\left( \boldsymbol{\eta}^\top \mathbf{T}(\mathbf{x}) - A(\boldsymbol{\eta}) \right), \]

where \(\boldsymbol{\eta}\) are the natural parameters, \(\mathbf{T}(\mathbf{x})\) are the sufficient statistics, \(h(\mathbf{x})\) is a base measure, and \(A(\boldsymbol{\eta})\) is the log-normaliser. Bernoulli, Binomial, Poisson, Gaussian, Beta, Gamma, Exponential, Categorical, almost every named distribution is in this family.

Every exponential family has a conjugate prior, with parameters that look like "pseudo-counts" of pseudo-observations. The posterior then has the same form, with the parameters incremented by the actual data's sufficient statistics. This unification is the deep reason the conjugate-pair magic works.

You do not have to memorise the formal machinery to use it. What you need to know: when designing a probabilistic model, picking priors that are conjugate to your likelihood gives you closed-form updates and makes everything orders of magnitude easier. When the conjugate prior is unsuitable, you abandon analytic tractability and reach for numerical methods (variational inference, MCMC).

Test yourself

You observe 7 heads in 10 coin flips. With a Beta\((1, 1)\) prior on the bias \(\theta\), what is the posterior?

41. Change of variables

You have a random variable \(X\) with density \(p_X(x)\). You define a new random variable \(Y = f(X)\) for some smooth invertible function \(f\). What is the density of \(Y\)?

For a single variable, the answer is

\[ p_Y(y) = p_X(f^{-1}(y)) \left| \frac{d f^{-1}}{dy}(y) \right|. \]

The density at \(y\) equals the density at the corresponding \(x\) value, multiplied by a Jacobian factor that accounts for how much the transformation stretches or squishes space. If \(f\) compresses a region of \(x\) into a smaller region of \(y\), the density goes up to compensate.

Multidimensional version

For \(\mathbf{Y} = \mathbf{f}(\mathbf{X})\) with \(\mathbf{f}\) invertible, the same formula holds with the absolute determinant of the Jacobian:

\[ p_\mathbf{Y}(\mathbf{y}) = p_\mathbf{X}(\mathbf{f}^{-1}(\mathbf{y})) \left| \det \mathbf{J}_{\mathbf{f}^{-1}}(\mathbf{y}) \right|. \]

Equivalently, using the forward Jacobian: \(p_\mathbf{Y}(\mathbf{y}) = p_\mathbf{X}(\mathbf{x}) / |\det \mathbf{J}_{\mathbf{f}}(\mathbf{x})|\), where \(\mathbf{x} = \mathbf{f}^{-1}(\mathbf{y})\).

Why this matters

  • Reparameterisation tricks in variational inference and normalising flows are all change-of-variable formulas applied to neural-network-shaped \(\mathbf{f}\).
  • Inverse transform sampling: to sample from any distribution, sample uniformly from \([0, 1]\) and apply the inverse CDF. The change-of-variable formula proves that this works.
  • Jacobian determinants in deep learning appear when training generative models that learn invertible transformations.

That closes Part V. You can now talk about random variables, joint distributions, conditional distributions, expectations, and the Gaussian. Part VI puts the calculus from Part IV and the geometry from Parts I-III to work in the form of optimisation algorithms.

Test yourself

If \(X \sim \text{Uniform}[0, 1]\) and \(Y = -\log(X)\), what is the density of \(Y\)?

Part VI. Continuous optimisation

Finding the lowest point on a landscape.

Training a machine learning model is almost always an optimisation problem: find the parameters that minimise a loss. The loss is a function from a parameter space to a real number, and you want the input that produces the smallest output. This part covers how to do that systematically.

42. The optimisation problem

The standard form of a continuous optimisation problem is

\[ \min_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) \le 0, \; h_j(\mathbf{x}) = 0. \]

You want to minimise an objective function \(f\) over some feasible set defined by inequality constraints \(g_i\) and equality constraints \(h_j\). When there are no constraints, the problem is unconstrained, which is the simplest case and the focus of the next two modules.

Minima vs maxima

Maximising \(f\) is the same as minimising \(-f\). All conventions in this part assume minimisation; you flip a sign for the maximisation case.

Local vs global

A local minimum is a point \(\mathbf{x}^*\) where \(f(\mathbf{x}^*) \le f(\mathbf{x})\) for all \(\mathbf{x}\) in some neighbourhood. A global minimum is one where this holds over the entire feasible set. For convex problems (Module 46), every local minimum is global. For non-convex problems (most of deep learning), local minima can be very different from the global minimum, and finding the latter is in general impossible.

In practice, "good local minimum" is the realistic goal for non-convex training. There is empirical and theoretical evidence that for over-parameterised neural networks, most local minima are roughly equally good, which is one reason gradient descent works as well as it does.

First-order optimality

For a smooth unconstrained \(f\), a necessary condition for \(\mathbf{x}^*\) to be a local minimum is

\[ \nabla f(\mathbf{x}^*) = \mathbf{0}. \]

The gradient must vanish. This is necessary but not sufficient, saddle points and maxima also satisfy it. To distinguish, you need second-order information (the Hessian, Module 30).

Sometimes you can solve \(\nabla f = \mathbf{0}\) in closed form. For least-squares regression, you can. For most ML problems, you cannot, and you reach for iterative methods.

Test yourself

For \(f(x) = x^4 - 2x^2\), find all critical points by solving \(f'(x) = 0\). Which are minima, maxima, or saddles?

43. Gradient descent

The single most important optimisation algorithm in ML. The idea is direct: at any point, the gradient points in the direction of steepest ascent. To go down, take a step in the opposite direction.

\[ \mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k). \]

The scalar \(\alpha > 0\) is the learning rate (or step size). Choosing it well is half the work.

Manim · 04aGradient descent on a 2D contour plot. The path follows the negative gradient direction at each step.
Manim · 04bThe same idea in 3D. The ball walks down a smooth bowl; the green trail on the floor is the descent direction in input space, the yellow curve on the surface is the trajectory of the loss.

What makes a good learning rate

Too small: the algorithm crawls; you waste compute and may not reach a minimum in any reasonable time. Too large: the algorithm overshoots and diverges, with iterates spiralling outward forever. Just right: you make steady progress without overshooting.

For convex quadratic functions, the optimal fixed learning rate is \(\alpha = 2 / (\lambda_{\max} + \lambda_{\min})\) where \(\lambda\)'s are eigenvalues of the Hessian. Convergence is at a rate determined by the condition number \(\kappa = \lambda_{\max} / \lambda_{\min}\), high condition number means slow convergence. This is the source of the famous "narrow valley" problem: gradient descent zigzags inefficiently in long narrow minima.

Stochastic gradient descent

For ML problems where the loss is a sum over many data points (which is essentially all of them), computing the full gradient at each step is prohibitively expensive when there are millions of points. Stochastic gradient descent (SGD) replaces the full gradient with a gradient computed on a single random data point (or a small batch):

\[ \mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f_{i_k}(\mathbf{x}_k), \quad i_k \sim \text{Uniform}(\{1, \ldots, N\}). \]

Each step is much cheaper. The trade-off: each step is noisier, so you take more steps to reach the same accuracy. In practice, SGD with mini-batches (\(B\) points instead of 1) hits the sweet spot for hardware: enough work to amortise overhead, small enough to keep variance manageable.

Why SGD often beats GD

Counterintuitively, the noise in SGD can help. It acts as a form of regularisation, preventing the algorithm from settling into sharp local minima that generalise poorly. SGD also gets unstuck from saddle points (which are abundant in high-dimensional non-convex losses) more easily than full-batch GD. Both effects are why SGD remains the dominant training algorithm for neural networks despite seeming "worse" on paper.

Test yourself

What happens to gradient descent on a 1D function \(f(x) = x^2\) if you set \(\alpha = 1.5\)? Trace a few iterates by hand.

44. Step-size schedules and momentum

Plain SGD with a fixed learning rate is rarely the best you can do. Two refinements show up in nearly every modern optimiser.

Decaying step size

You start with a large learning rate to make progress fast, then shrink it over time so the algorithm settles. Common schedules:

  • Step decay. Multiply \(\alpha\) by 0.1 every \(K\) epochs.
  • Polynomial decay. \(\alpha_k = \alpha_0 (1 + k/T)^{-p}\).
  • Cosine annealing. \(\alpha\) follows a cosine curve from \(\alpha_0\) to 0 over training. Surprisingly effective.
  • Warm restart. Periodically reset \(\alpha\) to its initial value and decay again. Helps escape local minima.

Momentum

Plain gradient descent zigzags in narrow valleys because the gradient direction can flip from step to step. Momentum averages over recent gradients to smooth this out:

\[ \mathbf{v}_{k+1} = \beta \mathbf{v}_k + \nabla f(\mathbf{x}_k), \quad \mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \mathbf{v}_{k+1}. \]

The "velocity" \(\mathbf{v}\) accumulates an exponentially weighted average of past gradients. The hyperparameter \(\beta \in [0, 1)\) controls how much memory it has; \(\beta = 0.9\) is typical. Physically, you can think of the parameter as a heavy ball rolling down the loss surface: it builds up speed in the average gradient direction and is less perturbed by individual noisy gradients.

Adam

The current default optimiser for almost everything: Adam combines momentum with a per-parameter scaling that adapts based on the magnitude of recent gradients. The update is

\[ \mathbf{m}_k = \beta_1 \mathbf{m}_{k-1} + (1 - \beta_1) \nabla f, \quad \mathbf{v}_k = \beta_2 \mathbf{v}_{k-1} + (1 - \beta_2) (\nabla f)^2, \]
\[ \mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \frac{\mathbf{m}_k}{\sqrt{\mathbf{v}_k} + \epsilon}. \]

(Plus a bias correction in the early iterations.) The first moment \(\mathbf{m}\) is momentum. The second moment \(\mathbf{v}\) tracks the variance of recent gradients per parameter. Dividing by \(\sqrt{\mathbf{v}}\) gives parameters with consistently large gradients smaller updates, and parameters with small gradients larger updates. The end effect: Adam works well with relatively little tuning across a wide range of problems, which is why it dominates.

Variants worth knowing exist (RMSProp, AdamW, AdaGrad, NAdam) but the core ideas are momentum + per-parameter scaling.

Test yourself

Why does momentum help in narrow valleys? Sketch the trajectory of plain GD vs GD-with-momentum on an elongated quadratic.

45. Constrained optimisation

What if you have constraints? You want to minimise \(f(\mathbf{x})\) subject to \(g_i(\mathbf{x}) \le 0\) and \(h_j(\mathbf{x}) = 0\). Naive gradient descent can violate the constraints.

Equality constraints and Lagrange multipliers

For pure equality constraints, the classical method is Lagrange multipliers. Form the Lagrangian:

\[ \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) + \sum_j \lambda_j h_j(\mathbf{x}). \]

The optimum lies at a stationary point of \(\mathcal{L}\) where \(\nabla_\mathbf{x} \mathcal{L} = \mathbf{0}\) and \(\nabla_{\boldsymbol{\lambda}} \mathcal{L} = \mathbf{0}\) (the latter just restates \(h_j = 0\)). Geometrically: at an optimum, the gradient of \(f\) is a linear combination of the gradients of the active constraints. The constraints "pull" the optimum off the unconstrained gradient direction.

Inequality constraints and KKT

For inequality constraints, you build a Lagrangian with a multiplier per constraint, but the conditions are richer. They are the Karush-Kuhn-Tucker (KKT) conditions, covered in Module 47.

Practical approaches

For modern ML, three workhorses:

  • Penalties. Add a penalty term to the objective that grows when constraints are violated. \(f(\mathbf{x}) + \mu \sum_j h_j(\mathbf{x})^2\). Drive \(\mu \to \infty\) and you recover the constrained problem. Easy to implement; can be ill-conditioned at high \(\mu\).
  • Projected gradient descent. After each gradient step, project the result back onto the feasible set. Works when the projection is cheap (e.g., box constraints, simplex constraints).
  • Augmented Lagrangian / interior point methods. More sophisticated; the basis of solvers in convex optimisation libraries.

For ML specifically, most problems are unconstrained or have box constraints (e.g., a probability simplex), and these tools are sufficient.

Manim · 15Two functions, two fates. On the convex one (yellow) any descent reaches the global minimum. On the non-convex one (blue) the same algorithm gets stuck in a local minimum.
Test yourself

Minimise \(f(x, y) = x^2 + y^2\) subject to \(x + y = 1\) using Lagrange multipliers.

46. Convex sets and convex functions

Convexity is the property that distinguishes "easy" optimisation problems from "hard" ones.

Convex sets

A set \(C\) is convex if for every two points \(\mathbf{x}, \mathbf{y} \in C\) and every \(\theta \in [0, 1]\), the point \(\theta \mathbf{x} + (1-\theta) \mathbf{y}\) is also in \(C\). The line segment between any two points stays inside.

Examples: \(\mathbb{R}^n\) itself. Half-spaces \(\{\mathbf{x} : \mathbf{a}^\top \mathbf{x} \le b\}\). Balls. Ellipsoids. Polyhedra. The intersection of convex sets is convex (this is how complex feasible sets get built up). The union of two convex sets is generally not convex.

Convex functions

A function \(f\) is convex if its domain is convex and

\[ f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \le \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y}) \]

for all \(\mathbf{x}, \mathbf{y}\) in the domain and \(\theta \in [0, 1]\). Geometrically: the chord between any two points on the graph lies above the graph itself. The function curves upward (or stays linear).

Equivalent characterisations for twice-differentiable \(f\):

  • The Hessian is positive semidefinite everywhere.
  • The graph lies above every tangent plane: \(f(\mathbf{y}) \ge f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x})\) for all \(\mathbf{x}, \mathbf{y}\).

Why convexity matters

For a convex function over a convex set, two beautiful facts hold:

  1. Every local minimum is a global minimum. No need to worry about getting stuck.
  2. The set of global minima is itself convex. If there are multiple global minima, the line segment between any two of them also consists of global minima.

Algorithms have provable convergence guarantees on convex problems. Linear programming, quadratic programming, convex SVMs, logistic regression, all are convex and all admit efficient, reliable algorithms.

Most deep learning losses are not convex, and that is why theory there is much harder. But many sub-problems within ML pipelines are convex (e.g., per-layer least-squares solves), and those benefit from the full power of convex optimisation theory.

Test yourself

Is \(f(x) = e^x\) convex? Is \(f(x) = -\log(x)\) on \(x > 0\) convex? Is \(f(x) = x^4 - x^2\) convex?

47. KKT conditions

For a constrained optimisation problem with both inequality constraints \(g_i(\mathbf{x}) \le 0\) and equality constraints \(h_j(\mathbf{x}) = 0\), form the Lagrangian

\[ \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}, \boldsymbol{\lambda}) = f(\mathbf{x}) + \sum_i \mu_i g_i(\mathbf{x}) + \sum_j \lambda_j h_j(\mathbf{x}). \]

The KKT conditions are necessary conditions for \(\mathbf{x}^*\) to be a local minimum (under mild constraint qualifications). They are:

  1. Stationarity. \(\nabla_\mathbf{x} \mathcal{L}(\mathbf{x}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*) = \mathbf{0}\).
  2. Primal feasibility. \(g_i(\mathbf{x}^*) \le 0\) and \(h_j(\mathbf{x}^*) = 0\) for all \(i, j\).
  3. Dual feasibility. \(\mu_i^* \ge 0\) for all \(i\). (Equality multipliers \(\lambda_j\) can be any sign.)
  4. Complementary slackness. \(\mu_i^* g_i(\mathbf{x}^*) = 0\) for all \(i\). Either the constraint is tight (\(g_i = 0\)) or the multiplier is zero.

Complementary slackness is the key insight. It says that an inactive constraint (one that is not tight at the optimum) contributes nothing to the optimum. The only constraints that "matter" are the active ones, and their multipliers measure how much they are pulling.

Why KKT shows up in SVMs

In support vector machines (Part XI), the optimum is characterised by KKT conditions. Most data points have zero multiplier (they are away from the margin and their constraints are inactive). Only the points right on the margin, the support vectors, have non-zero multipliers and contribute to the decision boundary. This is the entire reason the algorithm is called "support vector" machine, and the structural insight comes from KKT.

Sufficient conditions

For convex problems, KKT is also sufficient: any point satisfying KKT is a global minimum. For non-convex problems, KKT is only necessary; you also need to check second-order conditions to confirm a minimum.

That closes Part VI. You now have the optimisation toolbox that almost every ML training algorithm sits on top of. The next part connects the dots: how do you go from "I have data and a question" to "I have a model and an optimisation problem to solve."

Test yourself

For the constrained problem "minimise \(x^2 + y^2\) subject to \(x + y \ge 1\)," write down the KKT conditions and solve them.

Part VII. When models meet data

From assumption to answer.

A machine learning algorithm is a procedure that takes a dataset and produces a model. The bridge between the two is a probabilistic story about how the data was generated, plus a principle for picking the parameters that best explain it. Maximum likelihood, maximum a posteriori, and full Bayesian inference are three increasingly cautious answers to the same question.

48. Models, data, parameters

A model is a parametric family of probability distributions. Each setting of the parameters \(\boldsymbol{\theta}\) gives one specific distribution \(p(\mathbf{x} \mid \boldsymbol{\theta})\). The job of training is to find the \(\boldsymbol{\theta}\) that best fits the data.

Examples to keep in mind:

  • A Gaussian model with unknown mean and variance. Parameters: \(\boldsymbol{\theta} = (\mu, \sigma^2)\). Each setting gives a different bell curve.
  • A linear regression model. Parameters: weights \(\mathbf{w}\) and noise variance \(\sigma^2\). Each setting gives a different line plus Gaussian noise.
  • A neural network. Parameters: all the weights and biases of all the layers, often millions of numbers. Each setting gives a different function from inputs to outputs.

Once you have a parametric family and a dataset \(\mathcal{D} = \{\mathbf{x}_1, \ldots, \mathbf{x}_N\}\), three principles give you three different ways to pick the parameters.

iid assumption

Almost universally, you assume the data points are iid: drawn independently from the same distribution. Under this assumption, the joint likelihood factors:

\[ p(\mathcal{D} \mid \boldsymbol{\theta}) = \prod_{n=1}^N p(\mathbf{x}_n \mid \boldsymbol{\theta}). \]

Taking a log turns the product into a sum, which is much easier to differentiate and optimise:

\[ \log p(\mathcal{D} \mid \boldsymbol{\theta}) = \sum_{n=1}^N \log p(\mathbf{x}_n \mid \boldsymbol{\theta}). \]

This is the log-likelihood and it is the function you will spend most of ML maximising or minimising the negative of.

Test yourself

Why is it almost always more convenient to work with the log of the likelihood than the likelihood itself?

49. Maximum likelihood

The simplest principle for choosing parameters: pick the \(\boldsymbol{\theta}\) that makes the observed data most likely.

\[ \boldsymbol{\theta}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} \, p(\mathcal{D} \mid \boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta}} \, \sum_{n} \log p(\mathbf{x}_n \mid \boldsymbol{\theta}). \]

This is maximum likelihood estimation (MLE). Pick the parameters under which the data is least surprising. It is what almost every ML training procedure does, even when not framed this way explicitly.

Examples

MLE for a coin's bias. Observe \(s\) heads in \(n\) flips. Likelihood: \(\theta^s (1-\theta)^{n-s}\). Take log, differentiate, set to zero, solve. Answer: \(\theta_{\text{MLE}} = s/n\). The empirical fraction. The MLE is the obvious thing.

MLE for the mean and variance of a Gaussian. Likelihood: product of \(\mathcal{N}(x_n \mid \mu, \sigma^2)\). Maximise. Answers: \(\mu_{\text{MLE}}\) is the sample mean, \(\sigma^2_{\text{MLE}}\) is the sample variance (using \(N\) in the denominator, not \(N-1\)).

MLE for linear regression with Gaussian noise. Maximising the likelihood is exactly minimising the squared-error loss. The MLE solution is the least-squares solution. We will work this out fully in Part VIII.

Why MLE is the default

Three properties make MLE attractive:

  • Consistency. Under standard conditions, as \(N \to \infty\), \(\boldsymbol{\theta}_{\text{MLE}}\) converges to the true parameter (assuming the model is correctly specified).
  • Asymptotic efficiency. Among all consistent estimators, MLE has the smallest variance for large \(N\). It is "as good as it gets" in the limit.
  • Composability. The MLE for a complex model often decomposes into MLEs for sub-models, which makes it tractable to derive.

Where MLE fails

For small datasets, MLE can severely overfit. With one coin flip showing heads, the MLE bias is 1.0: the coin is certain to come up heads forever. This is obviously absurd, and the next module fixes it by adding a prior.

Test yourself

Derive the MLE for the rate parameter \(\lambda\) of a Poisson distribution given counts \(x_1, \ldots, x_N\).

50. Maximum a posteriori

Bayes' theorem says you can combine prior beliefs with the likelihood to get a posterior. The simplest Bayesian-flavoured estimator picks the mode of the posterior:

\[ \boldsymbol{\theta}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} \, p(\boldsymbol{\theta} \mid \mathcal{D}) = \arg\max_{\boldsymbol{\theta}} \, p(\mathcal{D} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta}). \]

The evidence \(p(\mathcal{D})\) is constant in \(\boldsymbol{\theta}\), so it does not affect the argmax. Maximising the posterior is maximising likelihood times prior, or equivalently maximising log-likelihood plus log-prior.

MAP as regularised MLE

The deep insight: MAP is just MLE with a regularisation term. The log of the prior contributes a penalty on extreme parameter values, pulling the estimate toward whatever the prior considers plausible.

For example: if you put a Gaussian prior \(\boldsymbol{\theta} \sim \mathcal{N}(\mathbf{0}, \tau^2 \mathbf{I})\) on the parameters of any model, the log-prior is \(-\|\boldsymbol{\theta}\|^2 / (2\tau^2)\) plus a constant. MAP becomes

\[ \boldsymbol{\theta}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} \, \log p(\mathcal{D} \mid \boldsymbol{\theta}) - \tfrac{1}{2\tau^2} \|\boldsymbol{\theta}\|^2. \]

That is exactly L2 regularisation. Ridge regression is MAP linear regression with a Gaussian prior. Weight decay in neural networks is MAP under a Gaussian prior. The "regularisation strength" is set by the prior variance \(\tau^2\): tighter prior means stronger regularisation.

Similarly, an L1 (Laplace) prior on the parameters yields LASSO regression and sparsity. Different priors correspond to different regularisers, and the regulariser can always be read off as the negative log-prior.

Limitations

MAP gives a single point estimate. It tells you the most likely parameter, but not how uncertain you are. For a multi-modal or skewed posterior, the MAP point may not be representative of where the mass is. For full uncertainty quantification you need the next module.

Test yourself

For the coin-flipping problem with a Beta\((\alpha, \beta)\) prior, derive the MAP estimate of \(\theta\). How does it compare to the MLE \(s/n\)?

51. Bayesian inference, the right way

MLE gives a single point. MAP gives a single point with a prior. Full Bayesian inference gives the entire posterior distribution \(p(\boldsymbol{\theta} \mid \mathcal{D})\) and uses it for everything.

Predictions

For a new input \(\mathbf{x}^*\), the predictive distribution is

\[ p(y^* \mid \mathbf{x}^*, \mathcal{D}) = \int p(y^* \mid \mathbf{x}^*, \boldsymbol{\theta}) \, p(\boldsymbol{\theta} \mid \mathcal{D}) \, d\boldsymbol{\theta}. \]

You weight each parameter setting by its posterior probability and average the predictions. This is conceptually clean: the answer accounts for parameter uncertainty automatically. A model that is unsure about its parameters will have wider predictive intervals.

Why we do not always do this

The integral above is intractable for almost every interesting model. Two routes:

  • Conjugate priors. When you can choose a prior that is conjugate to the likelihood (Module 40), the posterior is in the same family as the prior, and the predictive integral often has a closed form. Bayesian linear regression (Module 56) is the textbook example.
  • Approximate inference. When closed forms are unavailable, you approximate. Two main families: Markov chain Monte Carlo (MCMC) draws samples from the posterior; variational inference fits a tractable distribution to approximate the posterior. Both are full subjects in their own right.

The point estimates revisited

MLE is what you get if you take the limit of MAP as the prior becomes uninformative (very wide). MAP is the mode of the posterior. The full Bayesian predictive is the average over the posterior. As the dataset grows, all three converge to the same answer (under standard regularity conditions). For small data, they can differ substantially, and the full Bayesian answer is generally the right one if you can afford to compute it.

That closes Part VII. The framework is now in place: data, model, principle. The next four parts each apply this framework to one specific problem: regression, dimensionality reduction, density estimation, and classification.

Test yourself

Why does the Bayesian predictive distribution typically have wider tails than the MAP-based plug-in prediction? When is this a feature, not a bug?

Part VIII. Linear regression

The simplest predictor.

Linear regression is the model where every other ML algorithm should be compared. It is fast, has a closed-form solution, has a clean Bayesian extension, and fits real problems surprisingly often. Understanding it deeply is more useful than learning twenty fancier models superficially.

52. Setting up the problem

You have \(N\) data points, each consisting of an input vector \(\mathbf{x}_n \in \mathbb{R}^D\) and a real-valued target \(y_n\). You want a function \(f(\mathbf{x})\) that predicts \(y\) from \(\mathbf{x}\). Linear regression chooses \(f\) to be a linear function of the inputs:

\[ f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b. \]

The weight vector \(\mathbf{w} \in \mathbb{R}^D\) and the bias \(b \in \mathbb{R}\) are the parameters. To absorb the bias into the weight vector, you append a constant 1 to every input: \(\tilde{\mathbf{x}} = (\mathbf{x}^\top, 1)^\top\). Then \(f(\mathbf{x}) = \mathbf{w}^\top \tilde{\mathbf{x}}\) with the bias as the last entry of \(\mathbf{w}\). For the rest of this part, we will work with this form and drop the tilde.

Stacking the data

Pack the inputs into the rows of a design matrix \(\mathbf{X} \in \mathbb{R}^{N \times D}\) and the targets into a vector \(\mathbf{y} \in \mathbb{R}^N\). The vector of predicted outputs is

\[ \hat{\mathbf{y}} = \mathbf{X}\mathbf{w}. \]

The residuals are \(\mathbf{r} = \mathbf{y} - \hat{\mathbf{y}}\). The squared error loss is

\[ L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 = \sum_{n=1}^N (y_n - \mathbf{w}^\top \mathbf{x}_n)^2. \]

Linear regression in its simplest form is: pick \(\mathbf{w}\) to minimise this loss. The minimum is called the least squares estimate.

Why linear?

Linear models have three things going for them. First, the optimisation has a closed-form solution (no iteration needed). Second, they admit a clean probabilistic interpretation as Gaussian likelihood. Third, they are the building block of nearly every more sophisticated model: a neural network is essentially many small linear regressions stacked with non-linearities between them.

Test yourself

If you have 100 data points each with 5 features, what is the shape of the design matrix \(\mathbf{X}\)? What about the weight vector \(\mathbf{w}\) (with the bias absorbed)?

53. Least squares as MLE

The squared error loss is not chosen arbitrarily. It comes from a probabilistic story.

Assume the targets are generated by

\[ y_n = \mathbf{w}^\top \mathbf{x}_n + \epsilon_n, \quad \epsilon_n \sim \mathcal{N}(0, \sigma^2) \text{ iid}. \]

The targets are linear in the inputs plus Gaussian noise. The likelihood of one data point is

\[ p(y_n \mid \mathbf{x}_n, \mathbf{w}, \sigma^2) = \mathcal{N}(y_n \mid \mathbf{w}^\top \mathbf{x}_n, \sigma^2). \]

Take the log of the iid joint likelihood:

\[ \log p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}, \sigma^2) = -\frac{N}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mathbf{w}^\top \mathbf{x}_n)^2. \]

Maximising this with respect to \(\mathbf{w}\) is equivalent to minimising the squared error sum, since everything else is constant in \(\mathbf{w}\). So least squares is maximum likelihood under Gaussian noise. The squared error loss is not a free parameter; it is the consequence of the Gaussian assumption.

This connects Part VII directly to Part VIII. If your noise model is something other than Gaussian, the appropriate loss is something other than squared error. Laplace noise gives \(\ell_1\) loss (least absolute deviations). Heavy-tailed noise (e.g., Student's t) gives a robust loss that down-weights outliers. The loss is not arbitrary; it encodes a noise assumption.

Test yourself

If you assume the noise is Laplace distributed (heavy-tailed) instead of Gaussian, what loss function do you minimise? Why does it correspond to a different fit when the data has outliers?

54. The normal equations

Minimise \(L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2\) by setting the gradient to zero.

\[ \nabla L(\mathbf{w}) = -2 \mathbf{X}^\top (\mathbf{y} - \mathbf{X}\mathbf{w}) = \mathbf{0}. \]

Rearrange to get

\[ \mathbf{X}^\top \mathbf{X} \, \mathbf{w} = \mathbf{X}^\top \mathbf{y}. \]

These are the normal equations. When \(\mathbf{X}^\top \mathbf{X}\) is invertible (which happens when \(\mathbf{X}\) has full column rank, equivalently when the features are not perfectly collinear and \(N \ge D\)), the solution is

\[ \mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}. \]

Closed form. No iteration, no learning rate. One linear-algebra computation gives you the optimum.

Numerical caution

You almost never compute \((\mathbf{X}^\top \mathbf{X})^{-1}\) explicitly. The matrix \(\mathbf{X}^\top \mathbf{X}\) has condition number equal to the squared condition number of \(\mathbf{X}\), which can be very large and lose precision. The numerically stable approach is to solve the normal equations using QR decomposition of \(\mathbf{X}\) or, even better, SVD.

If \(\mathbf{X} = \mathbf{Q}\mathbf{R}\) (QR decomposition), the solution is \(\mathbf{w}^* = \mathbf{R}^{-1} \mathbf{Q}^\top \mathbf{y}\), computed by triangular back-substitution. If \(\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\) (SVD), the solution is \(\mathbf{w}^* = \mathbf{V}\mathbf{\Sigma}^+ \mathbf{U}^\top \mathbf{y}\), where \(\mathbf{\Sigma}^+\) is the pseudoinverse (transpose with non-zero singular values inverted). The SVD form also handles the rank-deficient case, returning the minimum-norm solution.

When N is much smaller than D

If you have more features than data points (\(D > N\)), the matrix \(\mathbf{X}^\top \mathbf{X}\) is rank-deficient and the normal equations have infinitely many solutions. This is the under-determined regime, which is where modern ML lives (over-parameterised neural networks). The cure is regularisation, which is the next module.

Test yourself

The cost of solving the normal equations exactly is \(O(D^2 N + D^3)\). For very large \(D\), this is impractical. Suggest an alternative algorithm.

55. Regularisation

For ill-posed problems (more features than data, or near-collinear features), the unregularised solution is unstable. Tiny changes in the data produce huge changes in \(\mathbf{w}\). The fix: add a penalty that discourages large weights.

Ridge regression (L2)

Add a squared-norm penalty:

\[ L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|^2. \]

The hyperparameter \(\lambda > 0\) controls the strength. The solution is

\[ \mathbf{w}^*_{\text{ridge}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}. \]

Adding \(\lambda \mathbf{I}\) to \(\mathbf{X}^\top \mathbf{X}\) makes the matrix invertible even when the original is rank-deficient. It also shrinks the weights toward zero, which reduces variance at the cost of introducing some bias. As \(\lambda \to 0\) you recover ordinary least squares; as \(\lambda \to \infty\) the weights go to zero.

Recall from Module 50: ridge regression is exactly MAP linear regression with a zero-mean Gaussian prior on the weights. The prior variance \(\tau^2\) and the regularisation strength relate as \(\lambda = \sigma^2 / \tau^2\).

LASSO (L1)

Replace the squared norm with an L1 norm:

\[ L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1. \]

The L1 penalty has a famous property: it produces sparse solutions. Many weights are pushed to exactly zero, which acts as automatic feature selection. The solution has no closed form but is found by efficient iterative algorithms (coordinate descent, FISTA).

The geometric reason for sparsity: the L1 ball has corners at the axes. Optimising a smooth loss subject to an L1 budget tends to land at one of those corners, where many coordinates are exactly zero.

Picking the regularisation strength

\(\lambda\) is a hyperparameter. You set it by cross-validation: try several values, evaluate held-out error, pick the one with the lowest.

Test yourself

If your dataset has 1000 features but only 100 data points, why is some kind of regularisation unavoidable for linear regression?

56. Bayesian linear regression

Linear regression with a Gaussian likelihood and a Gaussian prior on the weights is the canonical example of a fully analytically tractable Bayesian model.

The setup

Likelihood: \(p(y_n \mid \mathbf{x}_n, \mathbf{w}) = \mathcal{N}(y_n \mid \mathbf{w}^\top \mathbf{x}_n, \sigma^2)\). Prior: \(p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{0}, \tau^2 \mathbf{I})\). Both Gaussian.

The posterior

By Bayes' theorem and Gaussian conjugacy, the posterior is also Gaussian:

\[ p(\mathbf{w} \mid \mathcal{D}) = \mathcal{N}(\mathbf{w} \mid \boldsymbol{\mu}_N, \boldsymbol{\Sigma}_N), \]

with

\[ \boldsymbol{\Sigma}_N = \left( \tfrac{1}{\tau^2} \mathbf{I} + \tfrac{1}{\sigma^2} \mathbf{X}^\top \mathbf{X} \right)^{-1}, \quad \boldsymbol{\mu}_N = \tfrac{1}{\sigma^2} \boldsymbol{\Sigma}_N \mathbf{X}^\top \mathbf{y}. \]

The posterior mean \(\boldsymbol{\mu}_N\) is the ridge solution. The posterior covariance \(\boldsymbol{\Sigma}_N\) tells you how uncertain you are about the weights, which the point estimate cannot give you.

The predictive distribution

For a new input \(\mathbf{x}^*\), the predictive distribution is

\[ p(y^* \mid \mathbf{x}^*, \mathcal{D}) = \mathcal{N}(y^* \mid \boldsymbol{\mu}_N^\top \mathbf{x}^*, \, \mathbf{x}^{*\top} \boldsymbol{\Sigma}_N \mathbf{x}^* + \sigma^2). \]

Two terms in the predictive variance: the second term \(\sigma^2\) is the irreducible noise. The first term \(\mathbf{x}^{*\top} \boldsymbol{\Sigma}_N \mathbf{x}^*\) is the additional uncertainty from not knowing the weights exactly. It is small near where the training data is dense and large in regions far from the data, exactly as intuition would suggest.

This is what makes Bayesian linear regression special. You get not just a prediction but a calibrated uncertainty estimate, and the math drops out cleanly because of Gaussian conjugacy.

Connection to Gaussian processes

Lift Bayesian linear regression to an infinite-dimensional feature space (using a kernel) and you get a Gaussian process: a Bayesian non-parametric regression model that is exact, has uncertainty estimates, and is the basis of much of probabilistic machine learning. The GP is essentially "Bayesian linear regression but in a feature space defined implicitly by a kernel."

Test yourself

For Bayesian linear regression, what happens to the posterior variance as you collect more data? Why?

57. Geometry of least squares

One last view of linear regression, the most elegant. The least-squares solution is exactly the orthogonal projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\).

Recall from Module 14: the projection of a vector \(\mathbf{y}\) onto the column space of a matrix \(\mathbf{X}\) (assumed full column rank) is

\[ \pi(\mathbf{y}) = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}. \]

Notice this is exactly \(\mathbf{X} \mathbf{w}^*\), where \(\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}\) is the least-squares solution. So the predicted vector \(\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}^*\) is the projection of the targets onto the span of the features. The residual \(\mathbf{y} - \hat{\mathbf{y}}\) is orthogonal to that span.

Manim · 16The target y, its projection onto the column space of X, and the residual perpendicular to the column space. Least squares is exactly this picture.

Why this view matters

Once you see least squares as a projection, several facts become obvious:

  • The residuals are orthogonal to the features. Each feature column of \(\mathbf{X}\) is in the column space, and the residual is orthogonal to the column space, so the inner product is zero. Equivalently \(\mathbf{X}^\top (\mathbf{y} - \hat{\mathbf{y}}) = \mathbf{0}\), which is just the normal equations again.
  • Adding a feature can only decrease the training error. Adding a column to \(\mathbf{X}\) makes the column space larger, and projecting onto a larger space gets you closer (or equal) to \(\mathbf{y}\). This is one reason regularisation matters: the training error is monotone in features but the test error is not.
  • The projection matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top\) is symmetric and idempotent (\(\mathbf{H}^2 = \mathbf{H}\)). It is called the "hat matrix" because it puts the hat on \(\mathbf{y}\): \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\).

That closes Part VIII. Linear regression is the simplest predictor. The next part takes the same techniques to a different problem: not predicting a target, but finding low-dimensional structure in the data itself.

Test yourself

If you fit ordinary least squares with one feature column \(\mathbf{x}\), the residuals satisfy \(\mathbf{x}^\top \mathbf{r} = 0\). Why does this make geometric sense?

Part IX. Dimensionality reduction with PCA

Finding the directions that matter.

High-dimensional data is hard to look at, hard to store, hard to fit models to, and full of noise. Principal Component Analysis is the simplest tool for reducing dimensionality without losing too much. It is a one-line algorithm with two equivalent derivations and a beautiful connection to the SVD.

58. The dimensionality problem

You have \(N\) data points in \(\mathbb{R}^D\), where \(D\) is large (hundreds, thousands, sometimes millions). Three things go wrong as \(D\) grows:

  1. Computational cost. Most algorithms scale at least linearly in \(D\), and many scale superlinearly. Operating on raw 1024×1024 images (\(D = 10^6\)) is impractical without reducing them somehow.
  2. Statistical cost. The "curse of dimensionality", in high dimensions, every point is far from every other, distances become uninformative, and you need exponentially more data to fill the space.
  3. Visualisation. You cannot see in 100 dimensions. Two or three is the limit.

The intuition behind PCA: real high-dimensional data usually lives near a much lower-dimensional subspace. A photograph of a face has 1024×1024 = 1M pixels, but the space of "actually plausible face images" is much smaller, perhaps a few hundred dimensions. Find that subspace and you compress the data without losing what matters.

What "matters" can mean

Two reasonable definitions of "the most important dimensions":

  • Variance. Keep the directions along which the data varies most. Discarding low-variance directions throws away components that look like constants, which by definition carry little information.
  • Reconstruction error. Project the data onto a low-dimensional subspace and ask: how close is the projection to the original? Pick the subspace that minimises this distance.

Surprisingly, both definitions give the same answer. We will derive it from each angle in the next two modules.

Test yourself

If your data has 1000 features but most are constant or near-constant, how many "real" dimensions does it have? What would PCA do with such data?

59. PCA as variance maximisation

Center the data so it has zero mean. Then ask: what direction \(\mathbf{u}_1\) (with \(\|\mathbf{u}_1\| = 1\)) maximises the variance of the projected data?

The projection of a centered data point \(\mathbf{x}\) onto \(\mathbf{u}_1\) is the scalar \(\mathbf{u}_1^\top \mathbf{x}\). The variance of this scalar across the dataset is

\[ \text{Var}(\mathbf{u}_1^\top \mathbf{x}) = \mathbf{u}_1^\top \mathbf{S} \mathbf{u}_1, \]

where \(\mathbf{S} = \frac{1}{N} \mathbf{X}^\top \mathbf{X}\) is the sample covariance matrix (with the data centered). Maximise this subject to \(\|\mathbf{u}_1\| = 1\) using Lagrange multipliers (Module 45):

\[ \mathcal{L}(\mathbf{u}_1, \lambda) = \mathbf{u}_1^\top \mathbf{S} \mathbf{u}_1 - \lambda(\mathbf{u}_1^\top \mathbf{u}_1 - 1). \]

Setting \(\partial \mathcal{L} / \partial \mathbf{u}_1 = 0\) gives \(\mathbf{S}\mathbf{u}_1 = \lambda \mathbf{u}_1\). The optimal \(\mathbf{u}_1\) is an eigenvector of \(\mathbf{S}\), and the variance it captures is the corresponding eigenvalue \(\lambda\). To maximise variance, pick the eigenvector with the largest eigenvalue.

The next directions

To find the direction that captures the second-most variance, you constrain it to be orthogonal to \(\mathbf{u}_1\). Repeating the Lagrangian argument, you find that \(\mathbf{u}_2\) is the eigenvector of \(\mathbf{S}\) with the second-largest eigenvalue. By induction, the \(k\)-th principal direction is the \(k\)-th eigenvector.

The algorithm

PCA in three lines:

  1. Center the data.
  2. Compute the covariance matrix \(\mathbf{S} = \frac{1}{N} \mathbf{X}^\top \mathbf{X}\).
  3. Take its eigendecomposition. The eigenvectors are the principal components, ordered by decreasing eigenvalue.

To project to \(K\) dimensions, stack the top \(K\) eigenvectors as columns of a matrix \(\mathbf{U}_K\) and compute \(\mathbf{Z} = \mathbf{X}\mathbf{U}_K\). The reduced representation is \(\mathbf{Z} \in \mathbb{R}^{N \times K}\).

Manim · 052D data with anisotropic variance. The first principal direction aligns with the longest axis of the data ellipse; the second is orthogonal to it.
Test yourself

If your data has covariance matrix \(\mathbf{S} = \begin{pmatrix} 4 & 0 \\ 0 & 1 \end{pmatrix}\), what are the principal components? What variance does each capture?

60. PCA as reconstruction error

The other derivation: pick a \(K\)-dimensional subspace such that projecting data onto it loses as little as possible.

Given an orthonormal basis \(\mathbf{u}_1, \ldots, \mathbf{u}_K\) of the candidate subspace, the projection of \(\mathbf{x}\) is

\[ \tilde{\mathbf{x}} = \sum_{k=1}^K (\mathbf{u}_k^\top \mathbf{x}) \mathbf{u}_k. \]

The squared reconstruction error of one point is \(\|\mathbf{x} - \tilde{\mathbf{x}}\|^2\). Summed over the dataset and minimised over the choice of basis, you get the same answer as variance maximisation.

Why the two views agree

Consider the identity

\[ \|\mathbf{x}\|^2 = \|\tilde{\mathbf{x}}\|^2 + \|\mathbf{x} - \tilde{\mathbf{x}}\|^2 \]

(the Pythagorean theorem, since \(\tilde{\mathbf{x}}\) and \(\mathbf{x} - \tilde{\mathbf{x}}\) are orthogonal). Sum over the dataset:

\[ \sum_n \|\mathbf{x}_n\|^2 = \sum_n \|\tilde{\mathbf{x}}_n\|^2 + \sum_n \|\mathbf{x}_n - \tilde{\mathbf{x}}_n\|^2. \]

The left side is fixed. So minimising the reconstruction error is equivalent to maximising the variance of the projections. Same problem, two framings.

The reconstruction

To reconstruct in the original space, multiply the reduced coordinates by the basis matrix:

\[ \tilde{\mathbf{X}} = \mathbf{Z} \mathbf{U}_K^\top = \mathbf{X} \mathbf{U}_K \mathbf{U}_K^\top. \]

The map \(\mathbf{X} \to \tilde{\mathbf{X}}\) is the orthogonal projection onto the top-\(K\) principal subspace. \(\tilde{\mathbf{X}}\) is the rank-\(K\) approximation of \(\mathbf{X}\) (Module 22), and the singular values squared are the eigenvalues of the covariance.

Test yourself

If you keep only the first 2 principal components of 100-dimensional data, what fraction of the total variance do you preserve, in terms of the eigenvalues?

61. Computing PCA

Two ways to compute PCA in practice. Both give the same answer; one is faster in different regimes.

Eigendecomposition of the covariance matrix

The textbook approach. Cost: \(O(ND^2)\) to form \(\mathbf{S}\), then \(O(D^3)\) to eigendecompose. Good when \(D\) is small and \(N\) is large.

SVD of the data matrix

Take the SVD of the centered data: \(\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\). The principal components are the columns of \(\mathbf{V}\), ordered by singular value. The eigenvalues of \(\mathbf{S} = \mathbf{X}^\top \mathbf{X} / N\) are \(\sigma_i^2 / N\). Cost: \(O(\min(N D^2, D N^2))\). Better numerical stability since you never form \(\mathbf{X}^\top \mathbf{X}\), which squares the condition number.

For very large \(N\) and \(D\), neither full method is feasible. Iterative methods (power iteration, Lanczos) compute just the top few eigenvectors without ever forming the full decomposition. Most numerical libraries default to one of these for large-scale PCA.

The high-dimensional small-sample case

If \(D \gg N\) (e.g., 10000 features, 100 samples), the covariance matrix \(\mathbf{S}\) is \(D \times D\) and rank-deficient. There is a clever trick: compute the eigendecomposition of \(\mathbf{X}\mathbf{X}^\top\) (which is \(N \times N\)) and recover the eigenvectors of \(\mathbf{S}\) from those. This is the Gram-trick form of PCA, sometimes called the "dual" form, and it is what kernel PCA generalises.

Test yourself

For a 1000-dimensional dataset with 50 samples, would you eigendecompose the \(1000 \times 1000\) covariance or the \(50 \times 50\) Gram matrix? Why?

62. PCA via SVD

The SVD relationship is so useful it deserves its own module.

Centred data matrix \(\mathbf{X} \in \mathbb{R}^{N \times D}\). SVD: \(\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\). The columns of \(\mathbf{V}\) are the principal directions. The diagonal of \(\mathbf{\Sigma}\) is the singular values \(\sigma_i\), and the variance captured by the \(i\)-th principal component is \(\sigma_i^2 / N\). The reduced representation is \(\mathbf{Z} = \mathbf{X}\mathbf{V}_K = \mathbf{U}_K \mathbf{\Sigma}_K\), where \(\mathbf{U}_K\) and \(\mathbf{\Sigma}_K\) are truncations to the top \(K\).

Why the SVD form is preferred

  • Numerically more stable. Forming \(\mathbf{X}^\top \mathbf{X}\) and then eigendecomposing squares the condition number; SVD on \(\mathbf{X}\) directly does not.
  • Handles non-square data and rank-deficient cases without special casing.
  • Connects directly to low-rank approximation (Module 22): the top-\(K\) PCA reconstruction is the best rank-\(K\) approximation in Frobenius norm.

Whitening

Once you have the principal directions, you can whiten the data by additionally rescaling each component to unit variance:

\[ \mathbf{Z}_{\text{white}} = \mathbf{X}\mathbf{V}_K \mathbf{\Sigma}_K^{-1} \sqrt{N}. \]

Whitened data has identity covariance: each component has unit variance and the components are uncorrelated. Useful as a preprocessing step before models that benefit from feature standardisation.

Test yourself

If you whiten data using PCA and then apply a rotation, do you still have whitened data? Why or why not?

63. Choosing the number of components

The unanswered question of PCA: how many components to keep? There is no single right answer, but several useful heuristics.

Variance-explained threshold

Pick the smallest \(K\) such that the first \(K\) components capture at least, say, 90% or 95% of the total variance:

\[ \sum_{i=1}^K \sigma_i^2 \, \big/ \, \sum_{i=1}^D \sigma_i^2 \ge 0.95. \]

The threshold is arbitrary, but the principle is sound: keep enough components that the reconstruction error is acceptably small.

The scree plot

Plot the eigenvalues in decreasing order. Often there is a visible "elbow" where the eigenvalues drop sharply, suggesting the number of "real" components and a tail of "noise" components. Cut at the elbow.

Cross-validation on a downstream task

If PCA is a preprocessing step for some downstream model (regression, classification), pick the \(K\) that minimises held-out error of the downstream model. This is the most principled approach when you have a clear objective.

Probabilistic PCA and the marginal likelihood

For a more rigorous Bayesian treatment, you can frame PCA as a probabilistic latent-variable model (probabilistic PCA, or PPCA) and compute the marginal likelihood as a function of \(K\). The maximum is a principled choice. The model is closely related to factor analysis.

That closes Part IX. PCA is the simplest dimensionality reduction technique; many others build on the same core idea (kernel PCA, ICA, autoencoders) by changing the assumption of linearity or the criterion. The next part moves from finding structure in unlabelled data to fitting an explicit probabilistic model to it.

Test yourself

If your scree plot shows eigenvalues 100, 90, 80, 5, 4, 3, 2, 1, ..., what does that suggest about the structure of the data and the right \(K\)?

Part X. Density estimation with Gaussian mixtures

One distribution made of many.

A single Gaussian is too simple to model most real data. A mixture of Gaussians is, in a precise sense, a universal approximator: any reasonable density can be matched arbitrarily closely by enough Gaussians. Fitting one is a beautiful application of the Expectation-Maximisation algorithm.

64. Why mixtures

Real data is rarely unimodal. Heights of people show two modes (one for women, one for men). Customer behaviour clusters into segments. Pixel intensities of natural images form complicated multi-peaked distributions. A single Gaussian cannot represent any of this, and forcing it to fit gives a mean and variance that capture nothing.

A mixture model is a weighted sum of simpler distributions:

\[ p(\mathbf{x}) = \sum_{k=1}^K \pi_k \, p_k(\mathbf{x}), \quad \sum_k \pi_k = 1, \quad \pi_k \ge 0. \]

The \(\pi_k\) are mixture weights: probabilities that a draw comes from component \(k\). Each \(p_k(\mathbf{x})\) is a component density, often Gaussian. The mixture is more flexible than any single component and can capture multi-modal, skewed, and irregular distributions.

Why Gaussians for the components?

Three reasons. First, fitting a single Gaussian has a closed form (just sample mean and covariance), so each component is cheap. Second, the multivariate Gaussian is parameterised by a mean and a covariance, which gives enough flexibility to capture local structure without too many parameters. Third, with enough Gaussians, you can approximate any density arbitrarily well.

Mixtures of other distributions exist (mixtures of Bernoullis for binary data, mixtures of multinomials for topic modelling), but the Gaussian case is the textbook one.

Test yourself

Why is a single Gaussian insufficient to model a dataset that has two clear clusters?

65. The GMM density

A Gaussian mixture model with \(K\) components has the density

\[ p(\mathbf{x} \mid \boldsymbol{\theta}) = \sum_{k=1}^K \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k), \]

where \(\boldsymbol{\theta} = \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\}_{k=1}^K\) is the full parameter set: a weight, a mean, and a covariance matrix per component.

Fitting it: the obvious thing fails

You might think you could just take the log-likelihood and maximise it like you did for a single Gaussian. Try it:

\[ \log p(\mathbf{X} \mid \boldsymbol{\theta}) = \sum_{n=1}^N \log \!\left( \sum_{k=1}^K \pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right). \]

The log of a sum, not a sum of logs. The derivatives of this with respect to \(\boldsymbol{\mu}_k\) couple all the components together; you cannot solve the resulting equations in closed form. There is no analytic MLE for the GMM.

The latent variable trick

The way out is to introduce a latent (unobserved) discrete variable for each data point that says which component generated it. Conditioned on the latents, the problem decomposes into per-component Gaussian fits, which are easy. The trick is that we do not know the latents, so we have to estimate both the latents and the parameters together. That is what the EM algorithm does, and we get to it in two modules.

Test yourself

For a one-dimensional GMM with two components, write the density. How many parameters does it have?

66. Latent variable formulation

Introduce, for each data point \(\mathbf{x}_n\), a latent indicator \(z_n \in \{1, 2, \ldots, K\}\) that says which component generated it. The generative story is:

  1. Sample \(z_n\) from Categorical\((\boldsymbol{\pi})\): with probability \(\pi_k\), say "this point comes from component \(k\)."
  2. Sample \(\mathbf{x}_n\) from \(\mathcal{N}(\boldsymbol{\mu}_{z_n}, \boldsymbol{\Sigma}_{z_n})\).

Marginalising out \(z_n\) recovers the GMM density:

\[ p(\mathbf{x}_n) = \sum_k p(z_n = k) \, p(\mathbf{x}_n \mid z_n = k) = \sum_k \pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k). \]

The complete-data log-likelihood

If you knew the \(z_n\), the log-likelihood would split into independent per-component pieces:

\[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \sum_n \log \pi_{z_n} + \sum_n \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_{z_n}, \boldsymbol{\Sigma}_{z_n}). \]

Maximising this, if you knew the \(z_n\), is trivial: each component's parameters are estimated from its own assigned points by ordinary Gaussian MLE.

The problem is that you do not know the \(z_n\). The EM algorithm gets around this by alternating: given the parameters, compute the posterior over \(z_n\); given the posterior over \(z_n\), update the parameters. Each step is closed-form; the iterations converge.

Test yourself

If you somehow knew the true component assignments \(z_n\) for every data point, how would you compute the MLE for \(\boldsymbol{\mu}_k\) and \(\boldsymbol{\Sigma}_k\)?

67. The EM algorithm

The Expectation-Maximisation algorithm. Two steps, repeated until convergence.

E-step (Expectation)

Compute the posterior probability of each latent assignment given the current parameters. For each point \(n\) and each component \(k\):

\[ \gamma_{nk} = p(z_n = k \mid \mathbf{x}_n, \boldsymbol{\theta}) = \frac{\pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_j \pi_j \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}. \]

This is just Bayes' theorem applied to the discrete latent. The numerator is the joint probability of the point and the assignment; the denominator normalises across all assignments. The \(\gamma_{nk}\) are called responsibilities: each data point is "softly assigned" to components, with point \(n\) splitting its weight across components in proportion to how plausible each is.

M-step (Maximisation)

With responsibilities fixed, update the parameters to maximise the expected complete-data log-likelihood. The updates have closed form:

\[ \pi_k = \frac{1}{N} \sum_n \gamma_{nk}, \quad \boldsymbol{\mu}_k = \frac{\sum_n \gamma_{nk} \mathbf{x}_n}{\sum_n \gamma_{nk}}, \]

\[ \boldsymbol{\Sigma}_k = \frac{\sum_n \gamma_{nk} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top}{\sum_n \gamma_{nk}}. \]

Each update is a weighted version of the standard Gaussian MLE, with the weights being the responsibilities. \(\sum_n \gamma_{nk}\) is the "effective number of points" assigned to component \(k\).

Manim · 06EM iterations on a 2D GMM with two components. Watch the responsibilities sharpen and the Gaussians settle onto the clusters.

Why it works

Each iteration cannot decrease the log-likelihood. The proof is via Jensen's inequality: the EM objective is a tight lower bound on the log-likelihood that you alternately tighten (E-step) and maximise (M-step). The algorithm converges to a local maximum of the log-likelihood, though not necessarily the global one.

K-means as a special case

If you fix all covariances to \(\sigma^2 \mathbf{I}\) and let \(\sigma^2 \to 0\), the responsibilities become "hard" (1 for the closest mean, 0 elsewhere) and EM reduces to K-means clustering. K-means is the limiting case of EM on a mixture of identical isotropic Gaussians.

Test yourself

Why is EM guaranteed to converge to a local maximum, not necessarily the global maximum, of the log-likelihood?

68. Convergence and pitfalls

EM is reliable but has a few failure modes worth knowing.

Local maxima

EM gets stuck at local maxima. Different initialisations give different fits. The standard remedy is to run EM many times from random starting points and keep the result with the highest likelihood. K-means++ initialisation (a smart way to spread initial means apart) is also widely used.

Singular covariances

If a component's covariance matrix becomes singular (a very flat Gaussian aligned with a few data points), the likelihood diverges to infinity. Pure MLE is broken in this regime: you can keep tightening the covariance around fewer and fewer points and the likelihood keeps increasing. Workarounds:

  • Add a small regularisation \(\epsilon \mathbf{I}\) to each covariance update, keeping it well-conditioned.
  • Use a Bayesian formulation with a prior on covariances that prevents collapse.
  • Restart the offending component from random data points.

How many components?

Like in PCA, choosing \(K\) is a model-selection question. Approaches: cross-validation on held-out likelihood, information criteria like AIC or BIC, or fully Bayesian alternatives like Dirichlet process mixtures that learn \(K\) from the data.

Computational cost

Each EM iteration is \(O(N K D^2)\) for full covariances. For high-dimensional data, full covariance per cluster is too many parameters; use diagonal or shared covariance structures, or apply PCA first to reduce \(D\).

That closes Part X. The Gaussian mixture model is the simplest interesting probabilistic model and EM is one of the most useful general algorithms in ML. Almost any problem with hidden structure (clustering, missing data, hidden Markov models, factor analysis) can be tackled with the same alternating-optimisation idea. The next part is the last canonical ML algorithm: classification with support vector machines.

Test yourself

You initialise GMM with three random points but the data has two clear clusters. What is likely to happen during EM?

Part XI. Classification with SVMs

The maximum-margin idea.

Classification is the problem of assigning each input to one of several discrete categories. The support vector machine is a classifier built on a single principle: separate the classes with the widest possible buffer. It has a beautiful geometric derivation, a clean dual formulation, and the kernel trick for handling non-linear boundaries.

69. The classification problem

You have data points \((\mathbf{x}_n, y_n)\) where \(\mathbf{x}_n \in \mathbb{R}^D\) and \(y_n \in \{-1, +1\}\) (the binary case). You want a function \(f(\mathbf{x})\) such that \(f(\mathbf{x}_n)\) and \(y_n\) agree on as many training points as possible, and the function generalises to new points.

For a linear classifier, the function takes the form

\[ f(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b). \]

The set \(\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} + b = 0\}\) is a hyperplane in \(\mathbb{R}^D\), and the classifier predicts +1 on one side and \(-1\) on the other. Training the classifier means finding \(\mathbf{w}\) and \(b\).

Linearly separable case

If there exists a hyperplane that perfectly separates the two classes, the data is linearly separable. In this case, infinitely many separating hyperplanes exist. The question becomes: which one is best?

Intuition: the hyperplane that has the largest margin to the closest points on either side. A wide buffer means the classifier is robust: small perturbations of the inputs do not flip the prediction. The SVM picks this hyperplane.

Test yourself

If your data is in 2D and you have +1 points at (1, 1), (2, 2) and -1 points at (-1, -1), (-2, -2), what is a separating hyperplane? Is there only one?

70. Separating hyperplanes and the margin

For a hyperplane \(\mathbf{w}^\top \mathbf{x} + b = 0\), the signed distance from a point \(\mathbf{x}\) to the hyperplane is \((\mathbf{w}^\top \mathbf{x} + b) / \|\mathbf{w}\|\). The distance is positive on the +1 side and negative on the -1 side.

For a labeled point \((\mathbf{x}_n, y_n)\), the quantity \(y_n(\mathbf{w}^\top \mathbf{x}_n + b) / \|\mathbf{w}\|\) is the signed margin. It is positive when the point is on the correct side and negative when misclassified. The minimum signed margin over the dataset is the geometric margin:

\[ \rho = \min_n \frac{y_n(\mathbf{w}^\top \mathbf{x}_n + b)}{\|\mathbf{w}\|}. \]

The maximum-margin classifier picks \(\mathbf{w}\) and \(b\) to maximise \(\rho\) subject to the data being correctly classified.

Why maximise margin?

Three justifications:

  • Robustness. A wide-margin classifier is less sensitive to small perturbations in the input.
  • Generalisation. Vapnik-Chervonenkis theory shows that wide-margin hyperplanes have a low VC dimension on bounded data, which implies better generalisation bounds.
  • Uniqueness. Among the infinitely many separating hyperplanes for separable data, the maximum-margin one is unique.

This is the principle the rest of the part formalises into a tractable optimisation problem.

Manim · 17Several candidate hyperplanes separate two classes. The SVM picks the one with the widest margin. The boldly-outlined points are the support vectors.
Test yourself

For two points at (-1, 0) and (1, 0) with labels -1 and +1, what is the maximum-margin separating hyperplane? What is the margin?

71. The primal SVM

To turn the maximum-margin idea into a tractable problem, we exploit a freedom: the same hyperplane is described by infinitely many \((\mathbf{w}, b)\) (any positive scaling). Fix the scaling so that the closest points have margin exactly 1: \(y_n (\mathbf{w}^\top \mathbf{x}_n + b) \ge 1\) for all \(n\), with equality on the closest points. The geometric margin is then \(1/\|\mathbf{w}\|\), so maximising it is equivalent to minimising \(\|\mathbf{w}\|\), or its square.

The hard-margin SVM

For linearly separable data, the SVM problem in primal form is:

\[ \min_{\mathbf{w}, b} \tfrac{1}{2} \|\mathbf{w}\|^2 \quad \text{subject to} \quad y_n(\mathbf{w}^\top \mathbf{x}_n + b) \ge 1 \text{ for all } n. \]

This is a quadratic programming problem with linear constraints. It has a unique solution (since the objective is strictly convex and the feasible set is convex), and standard QP solvers can find it efficiently.

Why the constraints look this way

The constraint \(y_n(\mathbf{w}^\top \mathbf{x}_n + b) \ge 1\) does two things at once. First, the sign forces the point to be on the correct side: \(y_n\) and \(\mathbf{w}^\top \mathbf{x}_n + b\) must have the same sign. Second, the magnitude forces it to be at least one "margin unit" away from the boundary. Points on the margin satisfy the constraint with equality; points further away satisfy it strictly.

What if the data is not separable?

The hard-margin formulation has no solution if no separating hyperplane exists. For real data this is the rule, not the exception. The soft-margin SVM (Module 74) introduces slack variables that allow some points to violate the margin at a cost. For now, focus on the separable case.

Test yourself

For the SVM primal, why is minimising \(\frac{1}{2}\|\mathbf{w}\|^2\) (instead of \(\|\mathbf{w}\|\)) more convenient mathematically? Hint: think about derivatives.

72. The dual SVM

The Lagrangian of the primal SVM is

\[ \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \tfrac{1}{2}\|\mathbf{w}\|^2 - \sum_n \alpha_n \big[ y_n(\mathbf{w}^\top \mathbf{x}_n + b) - 1 \big], \quad \alpha_n \ge 0. \]

Differentiating with respect to \(\mathbf{w}\) and \(b\) and setting to zero gives

\[ \mathbf{w} = \sum_n \alpha_n y_n \mathbf{x}_n, \quad \sum_n \alpha_n y_n = 0. \]

The first says the optimal weight vector is a linear combination of the training inputs, weighted by their labels and the multipliers. Substituting back into the Lagrangian gives the dual problem:

\[ \max_{\boldsymbol{\alpha} \ge 0} \, \sum_n \alpha_n - \tfrac{1}{2} \sum_{n, m} \alpha_n \alpha_m y_n y_m \mathbf{x}_n^\top \mathbf{x}_m, \quad \text{subject to } \sum_n \alpha_n y_n = 0. \]

Two things have happened. First, the primal optimisation over \(D+1\) variables has become a dual optimisation over \(N\) variables (one per data point). Second, and crucially, the data only appears through inner products \(\mathbf{x}_n^\top \mathbf{x}_m\). This is the doorway to kernels.

Support vectors

By KKT conditions (Module 47), at the optimum, complementary slackness forces \(\alpha_n = 0\) for points with strict inequality (interior points), and \(\alpha_n > 0\) only for points on the margin. The points with \(\alpha_n > 0\) are the support vectors. They are the only ones that matter; everything else can be discarded after training, and the classifier is unchanged. This is where the algorithm gets its name.

For typical separable data, the number of support vectors is small (a handful), making SVMs efficient at prediction time even when \(N\) is large.

Predicting

Given a new input \(\mathbf{x}^*\), the prediction is

\[ f(\mathbf{x}^*) = \text{sign}\!\left( \sum_{n : \alpha_n > 0} \alpha_n y_n \mathbf{x}_n^\top \mathbf{x}^* + b \right). \]

Inner products with the support vectors only.

Test yourself

Why is the dual SVM particularly useful when \(D\) (the input dimension) is much larger than \(N\) (the number of data points)?

73. Kernels and the kernel trick

The dual SVM uses inner products only. This is the key that unlocks non-linear classification.

Suppose you map your inputs to a higher-dimensional space via some feature map \(\phi : \mathbb{R}^D \to \mathbb{R}^M\), with \(M\) possibly enormous or infinite. In the feature space, you do linear SVM. The dual problem now has \(\phi(\mathbf{x}_n)^\top \phi(\mathbf{x}_m)\) where it had \(\mathbf{x}_n^\top \mathbf{x}_m\). If \(M\) is huge, computing \(\phi\) explicitly is expensive.

The kernel trick: define a function \(k(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^\top \phi(\mathbf{x}')\) that computes the inner product in feature space without ever forming \(\phi\). For some choices of \(\phi\), this function has a closed form that is far cheaper than computing \(\phi\) and the inner product.

Common kernels

  • Linear kernel. \(k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'\). The trivial case; you are doing ordinary linear SVM.
  • Polynomial kernel. \(k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}' + c)^d\). Implicit feature space contains all polynomials of degree at most \(d\). Useful for moderate non-linearity.
  • RBF (Gaussian) kernel. \(k(\mathbf{x}, \mathbf{x}') = \exp(-\|\mathbf{x} - \mathbf{x}'\|^2 / (2\sigma^2))\). Implicit feature space is infinite-dimensional. The most popular kernel for general-purpose SVM.
  • Sigmoid kernel. \(k(\mathbf{x}, \mathbf{x}') = \tanh(a\mathbf{x}^\top \mathbf{x}' + c)\). Loosely related to neural networks. Used historically, less common now.

Mercer's theorem

Not every function \(k(\cdot, \cdot)\) is a valid kernel. Mercer's theorem says \(k\) is a valid kernel (corresponds to an inner product in some feature space) if and only if it is symmetric and positive semi-definite (the kernel matrix \(K_{nm} = k(\mathbf{x}_n, \mathbf{x}_m)\) is symmetric PSD for any choice of points). When you design a custom kernel, you check this condition.

The kernelised SVM

Replace every \(\mathbf{x}_n^\top \mathbf{x}_m\) in the dual with \(k(\mathbf{x}_n, \mathbf{x}_m)\). The optimisation is unchanged in structure, but the implicit feature space can be very high-dimensional. Predictions on new inputs use \(k(\mathbf{x}_n, \mathbf{x}^*)\) in place of \(\mathbf{x}_n^\top \mathbf{x}^*\). The algorithm never sees the explicit feature space at all.

The kernel SVM was, for about 15 years, the dominant general-purpose classifier. It has since been surpassed by deep neural networks for most tasks where data is abundant, but it remains the right choice for small-to-medium datasets and is a clean illustration of the power of the dual formulation.

Manim · 18Two classes that cannot be separated by a line in 2D. Lifted to 3D via z = x² + y², a flat plane separates them cleanly. That is the kernel trick.
Test yourself

What does the implicit feature space of an RBF kernel look like? Why is it infinite-dimensional?

74. Soft margins

Real data is rarely linearly separable, even after kernelisation. The hard-margin SVM has no solution in this case. The fix: allow margin violations, but charge for them.

Slack variables

Introduce a non-negative slack variable \(\xi_n \ge 0\) for each point and modify the constraint:

\[ y_n(\mathbf{w}^\top \mathbf{x}_n + b) \ge 1 - \xi_n, \quad \xi_n \ge 0. \]

If a point satisfies the original constraint (correctly classified, on the right side of the margin), \(\xi_n = 0\). If it is on the wrong side of the margin or misclassified, \(\xi_n > 0\) by the amount of the violation.

The penalised objective

Add a penalty for the slacks to the objective:

\[ \min_{\mathbf{w}, b, \boldsymbol{\xi}} \tfrac{1}{2}\|\mathbf{w}\|^2 + C \sum_n \xi_n. \]

The hyperparameter \(C > 0\) controls the trade-off. Small \(C\): tolerate violations cheaply, focus on a wide margin (more bias, less variance). Large \(C\): violations are expensive, the model tries hard to fit (less bias, more variance). \(C \to \infty\) recovers the hard-margin SVM.

The dual of the soft-margin SVM is identical to the hard-margin dual except that the constraints on \(\boldsymbol{\alpha}\) become \(0 \le \alpha_n \le C\). The inner product structure is preserved, so the kernel trick still works.

Hinge loss connection

Eliminating the slacks (\(\xi_n = \max(0, 1 - y_n(\mathbf{w}^\top \mathbf{x}_n + b))\)) gives the unconstrained form:

\[ \min_{\mathbf{w}, b} \tfrac{1}{2}\|\mathbf{w}\|^2 + C \sum_n \max(0, 1 - y_n(\mathbf{w}^\top \mathbf{x}_n + b)). \]

The second term is the hinge loss. It is zero when the point is correctly classified with margin at least 1, and grows linearly with the violation otherwise. This view connects the SVM to the modern picture of supervised learning as "minimise a loss plus a regulariser", the hinge loss is the SVM's loss of choice, and \(\|\mathbf{w}\|^2\) is its L2 regulariser.

Where to go from here

Multi-class SVMs combine binary SVMs (one-vs-all, one-vs-one). Regression SVMs (SVR) use an \(\epsilon\)-insensitive loss instead of hinge. Structured SVMs handle output spaces more complex than \(\{-1, +1\}\). All of them inherit the same dual structure, the same kernel trick, and the same support-vector sparsity.

That closes the note. You now have the math foundations and four worked applications. The combinations are nearly limitless: the same eigendecomposition tool that powers PCA also powers spectral clustering. The same EM algorithm that fits GMMs also fits hidden Markov models and topic models. The same kernel trick that powers SVM also powers Gaussian processes. Once you see the connections, every paper becomes a variant of something you already understand.

Final prompt

Pick any one of the four ML problems in Parts VIII through XI. In one paragraph, sketch how it uses each of the four math foundations (linear algebra, calculus, probability, optimisation). Where does each one show up?