Part 0
A note before you start.
By the end of this note you will be able to read a machine learning paper, understand the math in it, and tell the difference between a real result and a wrapped-up triviality. The path runs through six chapters of math foundations and four chapters of canonical ML problems built on those foundations. The math comes first because everything else collapses without it.
How to use this note
Machine learning rests on four pillars: linear algebra, vector calculus, probability, and continuous optimisation. Each one is a substantial subject in its own right. The first six parts of this note cover them in turn. The last five parts apply them to four classical ML problems: linear regression, principal component analysis, Gaussian mixture models, and support vector machines.
You are reading this in the order it was written. The dependencies run forward: nothing in Part V relies on something later. If a module is hard, the prerequisites are above it.
Throughout the note, three things appear alongside the prose:
- Equations, set with MathJax. Inline math looks like \(y = mx + b\). Display math gets its own line: \[ \mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}. \]
- Animations (manim videos), embedded inline. Looped, no audio. They show what the algebra is doing geometrically.
- Interactive widgets, drawn in SVG. You drag points, slide parameters, and watch the math respond. Used wherever interactivity teaches better than animation.
Proofs are mostly omitted. The bias of this note is toward intuition, computation, and the "what is this for" question. When a result is stated without proof, you can take it on faith and look up the proof in any standard reference if you want one.
Each module ends with a "test yourself" prompt. Read it, try to answer aloud or on paper before moving on. The prompts are designed so that a confident answer means the module landed; a fuzzy one means a re-read is in order.
A note on background
You should be comfortable with high-school algebra and ideally a single-variable calculus course. Anything beyond that gets introduced where it shows up. If you have taken any university-level math course recently, you have plenty.
Why does this note teach the math foundations before the machine learning algorithms, and what would go wrong if you tried to learn them in the opposite order?
Part I. Linear algebra
The language everything else is written in.
A vector is the simplest object in machine learning that is more than a number. A matrix is the next step up. Almost every ML algorithm is, when you peel away the surface, a transformation of a vector by a matrix, possibly followed by a non-linear squash. Knowing what those operations mean geometrically is half the work of understanding why an algorithm does what it does.
1. Why linear algebra
Most things you measure end up in a list. A house has a list of features (square feet, bedrooms, year built, asking price). A photograph is a list of pixel intensities. A user's behaviour on a website is a list of click counts. Once you commit to representing the world as lists of numbers, you need a vocabulary for talking about them, and that vocabulary is linear algebra.
Two operations show up in essentially every algorithm.
- Adding two vectors. Combine two lists of numbers element-by-element. This is how you average data, how you compute residuals, how a neural network sums its inputs.
- Multiplying a vector by a matrix. Apply a linear transformation. This is how you rotate data, project it, change its coordinate system, or pass it through a layer of a neural network.
That is the whole list. Every other operation in this part is built from those two. The reason linear algebra dominates ML is not that the problems are inherently linear. It is that linear operations are tractable: they have closed-form solutions, fast algorithms, and clean geometric interpretations. Non-linear behaviour is then introduced in carefully controlled ways, on top of a linear backbone.
The promise of this part
By the end of Part I you will be able to look at an expression like \(\mathbf{x}^\top \mathbf{A} \mathbf{x}\) and immediately know its shape (it is a scalar), what \(\mathbf{A}\) is doing geometrically (it is acting on \(\mathbf{x}\) and the result is being measured against \(\mathbf{x}\) again), and how the answer changes if \(\mathbf{A}\) does. That kind of fluency is what makes the rest of ML readable.
Name three pieces of real-world data that you would naturally represent as a vector. For each, what would each entry of the vector mean?
2. Vectors
A vector is an ordered list of real numbers. It has a length (how many numbers it has, called its dimension) and you can do two operations to it without ever leaving the world of vectors.
The notation \(\mathbb{R}^n\) means "the set of all vectors with \(n\) real-number entries." A vector in \(\mathbb{R}^2\) has two entries and you can draw it on a piece of paper. A vector in \(\mathbb{R}^3\) has three and you can draw it in space. A vector in \(\mathbb{R}^{784}\) is a flattened 28×28 grayscale image, one entry per pixel; you cannot draw it but the algebra works exactly the same.
The two operations
You can add two vectors of the same dimension by adding their entries:
You can multiply a vector by a scalar (a single number) by multiplying every entry:
That is it. From these two operations alone, all of linear algebra is built up. A set with these two operations satisfying a few obvious algebraic rules (commutativity, associativity, a zero element, distributivity) is called a vector space. The set \(\mathbb{R}^n\) is the most familiar one; later parts will use other vector spaces (functions, polynomials, matrices) where the same rules apply.
Geometric picture
It helps to think of a 2D vector as an arrow from the origin to a point. Adding two vectors then means: lay the second arrow's tail at the first arrow's tip. The sum is the arrow from the original origin to the final tip. Scalar multiplication stretches or shrinks (or flips, if the scalar is negative) without changing the direction. The widget above is built on this picture.
If \(\mathbf{a} = (3, 1)^\top\) and \(\mathbf{b} = (-1, 2)^\top\), what is \(\mathbf{a} + \mathbf{b}\)? What is \(2\mathbf{a} - \mathbf{b}\)? Sketch all four arrows.
3. Span and linear independence
Combine the two operations from the last module and you get the most useful single concept in linear algebra. Given a set of vectors \(\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k\), a linear combination of them is any expression of the form
where the \(\alpha_i\) are scalars. As you let the scalars vary over all real numbers, the set of all vectors you can produce is called the span of \(\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}\).
The span has a clean geometric interpretation. The span of one non-zero vector is the line through the origin in that vector's direction. The span of two non-parallel vectors in 3D is the plane through the origin containing them. The span of three vectors that do not lie in one plane is all of \(\mathbb{R}^3\). In general, the span of \(k\) vectors in \(\mathbb{R}^n\) is a flat subspace of \(\mathbb{R}^n\) passing through the origin, of some dimension between 0 and \(\min(k, n)\).
Linear independence
A set of vectors is linearly independent if none of them can be written as a linear combination of the others. Equivalently, the only way to write \(\mathbf{0}\) as a linear combination is with all coefficients zero:
If they are not independent, they are linearly dependent: at least one is redundant, expressible as a combination of the others. Two parallel vectors in 2D are linearly dependent (one is a multiple of the other). Three vectors in 2D are always linearly dependent (they cannot all point in genuinely different directions when only two dimensions are available).
Subspaces
A subspace of \(\mathbb{R}^n\) is a non-empty subset that is closed under both vector addition and scalar multiplication. Adding two members or scaling one stays inside. Every span is a subspace; in fact, every subspace is the span of some set of vectors.
Subspaces are the structural objects you will care about in ML. The set of all images of dogs in a 784-dimensional pixel space is not a subspace (it is some weird curved manifold), but many useful approximations and projections do live in subspaces. PCA finds a low-dimensional subspace that captures most of the variance in a dataset. Linear regression projects the target onto a subspace spanned by the features. Knowing what a subspace is, and how to spot one, is foundational.
Are the vectors \((1,2)^\top\), \((2,4)^\top\), and \((1,0)^\top\) linearly independent in \(\mathbb{R}^2\)? What is the dimension of their span?
4. Basis and dimension
A basis of a vector space is a linearly independent set of vectors that spans the whole space. Two conditions, both required: independent (none redundant) and spanning (covers everything).
The most familiar basis for \(\mathbb{R}^n\) is the standard basis: the vectors
Any vector can be written as a unique linear combination of these. The vector \((3, 5, 2)^\top\) is just \(3\mathbf{e}_1 + 5\mathbf{e}_2 + 2\mathbf{e}_3\); the entries themselves are the coefficients in this basis.
Dimension
The deep theorem of linear algebra (one of the few that you should genuinely keep in mind): every basis of a given vector space has the same number of elements. This number is called the dimension of the space. \(\mathbb{R}^n\) has dimension \(n\). A line through the origin has dimension 1. A plane through the origin has dimension 2.
Different bases give different coordinates for the same vector. A point in the plane has one set of \((x, y)\) coordinates in the standard basis, and a different pair of numbers in any rotated basis. The point itself does not change; only its description does. This idea, that a vector is independent of any particular basis, is the core insight of linear algebra. We will return to it formally in Module 9 when we discuss change of basis.
Are \((1,1)^\top\) and \((1,-1)^\top\) a basis of \(\mathbb{R}^2\)? If so, write the vector \((3, 5)^\top\) as a linear combination of them.
5. Matrices: data and transformation
A matrix is a rectangular array of numbers. An \(m \times n\) matrix has \(m\) rows and \(n\) columns. Matrices in machine learning play two distinct roles, and you should keep them separate in your head.
Role 1: A matrix as a stack of data points
Suppose you have 100 houses, each described by 4 features (square feet, bedrooms, year built, price). You can store them as a \(100 \times 4\) matrix \(\mathbf{X}\), where each row is one house and each column is one feature. This is the most common shape of "data" in ML libraries: rows are observations, columns are variables.
Role 2: A matrix as a function
An \(m \times n\) matrix \(\mathbf{A}\) also acts on a vector \(\mathbf{x} \in \mathbb{R}^n\) and produces a vector \(\mathbf{y} \in \mathbb{R}^m\) by the rule \(\mathbf{y} = \mathbf{A}\mathbf{x}\). When viewed this way, \(\mathbf{A}\) is a function from \(\mathbb{R}^n\) to \(\mathbb{R}^m\), and a special kind of function: a linear map. It satisfies
for any vectors \(\mathbf{x}, \mathbf{y}\) and scalars \(\alpha, \beta\). It distributes over linear combinations. Geometrically, this means the matrix sends straight lines through the origin to straight lines through the origin, and parallel lines to parallel lines. It can stretch, rotate, reflect, project, and shear, but it cannot bend or curve.
The columns of A are where the basis vectors land
The single most useful fact about matrices: if \(\mathbf{A}\) is an \(m \times n\) matrix and \(\mathbf{e}_j\) is the \(j\)-th standard basis vector, then \(\mathbf{A}\mathbf{e}_j\) is the \(j\)-th column of \(\mathbf{A}\). So the columns of a matrix are exactly the images of the basis vectors under the transformation. This gives you a direct way to read off what a matrix does: look at where it sends \(\mathbf{e}_1, \mathbf{e}_2, \ldots\) and you have completely determined it.
For example, the matrix
sends \(\mathbf{e}_1 = (1,0)^\top\) to \((0,1)^\top\) and \(\mathbf{e}_2 = (0,1)^\top\) to \((-1, 0)^\top\). These two outputs are 90 degrees from where the inputs started. \(\mathbf{A}\) is a 90-degree counterclockwise rotation, and you read that off the columns.
Write down the \(2 \times 2\) matrix that scales x-coordinates by 3 and y-coordinates by 2. What does its action look like on the unit square?
6. Matrix arithmetic
Matrices add and scale element-wise, just like vectors. The interesting operation is multiplication.
Matrix-vector multiplication
If \(\mathbf{A}\) is \(m \times n\) and \(\mathbf{x}\) is in \(\mathbb{R}^n\), then \(\mathbf{A}\mathbf{x}\) is the vector in \(\mathbb{R}^m\) whose \(i\)-th entry is the dot product of the \(i\)-th row of \(\mathbf{A}\) with \(\mathbf{x}\):
An equally useful interpretation: \(\mathbf{A}\mathbf{x}\) is a linear combination of the columns of \(\mathbf{A}\), with the entries of \(\mathbf{x}\) as coefficients. So
where \(\mathbf{a}_j\) is the \(j\)-th column of \(\mathbf{A}\). Both views are correct and you should be fluent switching between them. The second is more useful when thinking geometrically: it says the matrix mixes its columns according to the recipe in \(\mathbf{x}\).
Matrix-matrix multiplication
If \(\mathbf{A}\) is \(m \times n\) and \(\mathbf{B}\) is \(n \times p\), then \(\mathbf{A}\mathbf{B}\) is the \(m \times p\) matrix whose \(j\)-th column is \(\mathbf{A}\) acting on the \(j\)-th column of \(\mathbf{B}\). Equivalently, \((\mathbf{A}\mathbf{B})_{ij} = \sum_k A_{ik} B_{kj}\). The number of columns of \(\mathbf{A}\) must match the number of rows of \(\mathbf{B}\); otherwise the product is undefined.
The cleanest way to think about \(\mathbf{A}\mathbf{B}\) is as function composition. The matrix \(\mathbf{B}\) is a linear map from \(\mathbb{R}^p\) to \(\mathbb{R}^n\). The matrix \(\mathbf{A}\) is a linear map from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). The product \(\mathbf{A}\mathbf{B}\) is the composed map: first apply \(\mathbf{B}\), then apply \(\mathbf{A}\), which gives a linear map from \(\mathbb{R}^p\) to \(\mathbb{R}^m\). This is why dimensions must agree, and why the result has the dimensions it does.
Matrix multiplication is not commutative
In general \(\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}\). Even when both products are defined and have the same shape, the results are usually different. Geometrically: rotating then stretching is not the same as stretching then rotating. The order of operations matters.
It is associative, though: \((\mathbf{A}\mathbf{B})\mathbf{C} = \mathbf{A}(\mathbf{B}\mathbf{C})\), so you can drop parentheses without ambiguity. And it distributes over addition: \(\mathbf{A}(\mathbf{B}+\mathbf{C}) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}\).
If \(\mathbf{A}\) is \(3 \times 4\) and \(\mathbf{B}\) is \(4 \times 2\), what shape is \(\mathbf{A}\mathbf{B}\)? Is \(\mathbf{B}\mathbf{A}\) defined?
7. Solving Ax = b
Given a matrix \(\mathbf{A}\) and a vector \(\mathbf{b}\), the equation \(\mathbf{A}\mathbf{x} = \mathbf{b}\) asks: which input \(\mathbf{x}\) does the linear map \(\mathbf{A}\) send to \(\mathbf{b}\)? This is the central problem of linear algebra and you will solve some version of it almost every day in ML, often without realising.
Three cases
Depending on \(\mathbf{A}\) and \(\mathbf{b}\), the equation \(\mathbf{A}\mathbf{x} = \mathbf{b}\) has zero, one, or infinitely many solutions.
- Unique solution. When \(\mathbf{A}\) is square and "full rank" (we will define this precisely in Module 8), the equation has exactly one \(\mathbf{x}\). This is the case where the map \(\mathbf{A}\) is invertible.
- No solution. When \(\mathbf{b}\) is not in the span of the columns of \(\mathbf{A}\), no input can possibly land at \(\mathbf{b}\). The system is inconsistent. This is the situation in linear regression, where the data is rarely exactly linear; we will use least squares to find the closest \(\mathbf{x}\) instead.
- Infinitely many solutions. When the columns of \(\mathbf{A}\) are not independent (some are redundant), the kernel is non-trivial and you can add any kernel element to a solution and get another. This is what regularisation is for, in part: it picks one solution out of the infinite family.
Gaussian elimination
The classical algorithm for solving \(\mathbf{A}\mathbf{x} = \mathbf{b}\) when a unique solution exists. It applies elementary row operations (swap two rows, scale a row by a non-zero constant, add a multiple of one row to another) to the augmented matrix \([\mathbf{A} \mid \mathbf{b}]\) until \(\mathbf{A}\) has been reduced to upper-triangular form (or further, to reduced row echelon form). Then back-substitute.
You will rarely run Gaussian elimination by hand on anything bigger than \(3 \times 3\). Numerical libraries do it for you, with care for stability (pivoting) and efficiency. The reason to know how it works: every linear algebra question is answered by some refinement of this algorithm.
Geometric picture
\(\mathbf{A}\mathbf{x} = \mathbf{b}\) is asking "what combination of the columns of \(\mathbf{A}\) equals \(\mathbf{b}\)?" If \(\mathbf{b}\) is in the column span of \(\mathbf{A}\), the answer is yes there is at least one combination. If the columns are independent, the combination is unique. Both pieces of insight matter, and together they explain the three cases above.
For the system \(x + y = 3\), \(2x + 2y = 7\), explain in one sentence why there is no solution. What about \(x + y = 3\), \(2x + 2y = 6\)?
8. Inverse, transpose, rank
Three properties of a matrix you should be able to compute and reason about.
The inverse
If \(\mathbf{A}\) is square and there exists a matrix \(\mathbf{A}^{-1}\) such that \(\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}\) (the identity matrix), then \(\mathbf{A}\) is invertible. The inverse is the matrix that "undoes" \(\mathbf{A}\): if \(\mathbf{A}\) is a 90-degree rotation, \(\mathbf{A}^{-1}\) is a 90-degree rotation in the opposite direction.
Not every square matrix has an inverse. A matrix is invertible exactly when its columns are linearly independent (equivalently, its determinant is non-zero, or its kernel contains only \(\mathbf{0}\)). When an inverse exists, the unique solution to \(\mathbf{A}\mathbf{x} = \mathbf{b}\) is \(\mathbf{x} = \mathbf{A}^{-1}\mathbf{b}\). In practice you almost never form \(\mathbf{A}^{-1}\) explicitly; you solve \(\mathbf{A}\mathbf{x} = \mathbf{b}\) directly because that is faster and more numerically stable.
The transpose
The transpose of an \(m \times n\) matrix \(\mathbf{A}\), written \(\mathbf{A}^\top\), is the \(n \times m\) matrix you get by swapping rows and columns: \((\mathbf{A}^\top)_{ij} = A_{ji}\). The transpose has a few essential properties you will use constantly:
A matrix that equals its own transpose is called symmetric. Symmetric matrices are everywhere in ML: covariance matrices are symmetric, the Hessian of a smooth function is symmetric, the Gram matrix \(\mathbf{X}^\top \mathbf{X}\) is symmetric. They have especially nice spectral properties (Module 18).
Rank
The rank of a matrix is the dimension of the span of its columns (equivalently, the dimension of the span of its rows). It tells you how much "real information" is in the matrix. A \(5 \times 3\) matrix can have rank at most 3. If its rank is 3, all three columns are linearly independent and the matrix is "full column rank." If the rank is 2, two columns are independent and the third is some combination of them.
Rank shows up in three guises that you should learn to recognise:
- The number of independent rows (or columns) of \(\mathbf{A}\).
- The dimension of the image of the map \(\mathbf{A}\): how many dimensions of \(\mathbb{R}^m\) are actually reached by \(\mathbf{A}\mathbf{x}\) as \(\mathbf{x}\) ranges over \(\mathbb{R}^n\).
- The number of non-zero singular values of \(\mathbf{A}\) (we get to singular values in Module 21).
All three definitions agree.
What is the rank of the matrix \(\begin{pmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \end{pmatrix}\)? Is it invertible? What is the dimension of its column space?
9. Linear maps and change of basis
A vector exists independently of any particular description. The pair \((3, 5)\) only makes sense once you have agreed on a basis. With the standard basis it means "3 to the right and 5 up." With a rotated basis it would mean something else. The vector itself is the same arrow in space; the coordinates depend on the choice of basis.
This means that a single linear map can be represented by many different matrices, depending on which bases you use for the input and output spaces. The map is the abstract object; the matrix is its description in coordinates.
The change-of-basis formula
Let \(T : \mathbb{R}^n \to \mathbb{R}^n\) be a linear map represented by a matrix \(\mathbf{A}\) in the standard basis. Suppose you choose a different basis whose vectors form the columns of an invertible matrix \(\mathbf{P}\). The matrix that represents \(T\) in this new basis is
The recipe: change coordinates from new-basis to standard via \(\mathbf{P}\), apply the map \(\mathbf{A}\) in standard coordinates, then change back to new-basis coordinates via \(\mathbf{P}^{-1}\). The composition is one matrix that does the whole thing in new-basis coordinates.
This formula is responsible for some of the most useful constructions in ML. Diagonalisation, which we cover in Module 19, is the operation of finding a special basis in which a matrix becomes diagonal. PCA changes basis to one in which the data has uncorrelated coordinates. Eigendecomposition, SVD, and Jordan form are all change-of-basis statements at heart.
Matrices with respect to two bases
For a map between different spaces (say \(T : \mathbb{R}^n \to \mathbb{R}^m\)), you choose a basis for each. If \(\mathbf{P}\) is the matrix whose columns are the new basis of \(\mathbb{R}^n\) (in standard coordinates), and \(\mathbf{Q}\) is the matrix whose columns are the new basis of \(\mathbb{R}^m\), then a map represented by \(\mathbf{A}\) in standard bases is represented by \(\mathbf{Q}^{-1}\mathbf{A}\mathbf{P}\) in the new ones. Same logic, two changes of basis.
If \(\mathbf{A}\) is the matrix of \(T\) in the standard basis, and \(\mathbf{P}\) is the matrix whose columns are the new basis vectors, what does \(\mathbf{P}^{-1}\mathbf{A}\mathbf{P}\) compute? What does each of the three operations represent geometrically?
10. Image, kernel, rank-nullity
Two subspaces are associated with every linear map, and they explain almost everything about how the map behaves.
Image (column space)
The image of \(\mathbf{A}\), written \(\text{Im}(\mathbf{A})\) or \(\text{col}(\mathbf{A})\), is the set of all outputs:
It is a subspace of \(\mathbb{R}^m\) (the codomain). It is exactly the span of the columns of \(\mathbf{A}\), which is why "image" and "column space" are the same thing. Its dimension equals the rank of \(\mathbf{A}\).
Kernel (null space)
The kernel of \(\mathbf{A}\), written \(\ker(\mathbf{A})\) or \(\text{null}(\mathbf{A})\), is the set of inputs that get mapped to zero:
It is a subspace of \(\mathbb{R}^n\) (the domain). Its dimension is called the nullity. The kernel measures how much the map "collapses": if \(\ker(\mathbf{A}) = \{\mathbf{0}\}\) then \(\mathbf{A}\) is injective and you can recover \(\mathbf{x}\) from \(\mathbf{A}\mathbf{x}\). If the kernel is non-trivial, then many different inputs share the same output.
The rank-nullity theorem
The bookkeeping rule that ties them together. For any \(m \times n\) matrix:
The total dimension of the input space splits cleanly into "the dimensions that survive" (the rank, image dimension) and "the dimensions that get squashed to zero" (the nullity). This is one of the few theorems in this part you should keep on hand. It explains why a tall skinny matrix (\(m > n\) and full column rank) has trivial kernel, why a wide short matrix has non-trivial kernel, and why the dimensions of the various subspaces always add up correctly.
Why this matters in ML
Linear regression on a dataset of \(N\) points with \(D\) features uses an \(N \times D\) feature matrix. If \(N > D\) and the features are independent, the kernel is trivial and a unique least-squares solution exists. If \(D > N\) (more features than data points, the high-dimensional regime), the kernel is non-trivial, the system is underdetermined, and you need regularisation to pick one solution. PCA finds a low-dimensional image; the rest of the original space gets sent to a particular orthogonal complement that you can ignore. These all hinge on the image-kernel split.
That closes Part I. Linear algebra is the deepest tool in this note and we will use it constantly. The next part adds geometry on top: lengths, angles, projections, rotations.
For a \(4 \times 6\) matrix of rank 3, what is the dimension of the kernel? What is the dimension of the image?
Part II. Analytic geometry
Lengths, angles, and projections.
Linear algebra gives you addition and scaling. Geometry adds three things: how long is a vector, what is the angle between two vectors, and how do you project one vector onto another. The whole of analytic geometry is built from one extra ingredient called the inner product.
11. Norms and the geometry of vectors
A norm is a function that assigns a non-negative length to every vector. It must satisfy three properties: it is zero only for the zero vector, it scales linearly with scaling (\(\|c\mathbf{v}\| = |c|\|\mathbf{v}\|\)), and it satisfies the triangle inequality (\(\|\mathbf{u} + \mathbf{v}\| \le \|\mathbf{u}\| + \|\mathbf{v}\|\)).
The norm you are most familiar with is the Euclidean norm (also called the \(\ell_2\) norm):
It is the straight-line distance from the origin to the point \(\mathbf{v}\). Pythagoras's theorem in \(n\) dimensions.
Other useful norms
The \(\ell_1\) norm sums absolute values: \(\|\mathbf{v}\|_1 = |v_1| + |v_2| + \cdots + |v_n|\). It is the "taxicab" distance: how far you would walk if you could only move along the axes. It plays a starring role in regularisation (LASSO regression) because it encourages sparse solutions: minimising it tends to set entries to exactly zero rather than to small non-zero values.
The \(\ell_\infty\) norm takes the largest absolute value: \(\|\mathbf{v}\|_\infty = \max_i |v_i|\). It is the worst-case coordinate.
Each norm carves out a different "unit ball" (the set of vectors with norm at most 1). The Euclidean unit ball is a disc. The \(\ell_1\) ball is a diamond. The \(\ell_\infty\) ball is a square. The shape of the unit ball is what gives each norm its character.
Why norms matter
Norms measure error. When you fit a model and ask "how far is the prediction from the truth," you are computing a norm of the residual. Different norms give different fits. Squared Euclidean error gives least-squares regression, which is sensitive to outliers because it weights large errors quadratically. \(\ell_1\) error gives least absolute deviations regression, which is robust to outliers but harder to compute. The choice of norm encodes the loss function.
For \(\mathbf{v} = (3, 4)^\top\), compute \(\|\mathbf{v}\|_1\), \(\|\mathbf{v}\|_2\), and \(\|\mathbf{v}\|_\infty\). Which norm gives the largest value? Which gives the smallest?
12. Inner products: angles, distances
Norms tell you "how big." The inner product tells you "how aligned." The standard inner product on \(\mathbb{R}^n\) is the dot product:
It produces a single number from two vectors. The geometric interpretation: \(\langle \mathbf{u}, \mathbf{v} \rangle = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta\), where \(\theta\) is the angle between the two vectors. From this you can immediately read three facts:
- If \(\langle \mathbf{u}, \mathbf{v} \rangle = 0\) and neither vector is zero, the vectors are orthogonal (perpendicular). The dot product is the cleanest test for perpendicularity.
- If \(\langle \mathbf{u}, \mathbf{v} \rangle > 0\), they point in roughly the same direction (acute angle). If negative, they point in opposite directions (obtuse angle).
- The Euclidean norm comes from the inner product: \(\|\mathbf{v}\|_2 = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}\). Squared length equals self-inner-product.
The Cauchy-Schwarz inequality
One of the most-used inequalities in mathematics: for any inner product,
with equality exactly when one vector is a scalar multiple of the other. This is a consequence of \(|\cos\theta| \le 1\), but the inequality holds for any inner product, not just the dot product. It is the engine behind the triangle inequality and shows up in proofs of convergence everywhere.
Distance, computed via inner product
The squared distance between two vectors is:
Expanding this way is something you will do many times. It is also why squared distances are often more convenient than distances: they are quadratic in the inputs, which means they are differentiable, which means you can optimise them with calculus.
General inner products
The dot product is one inner product on \(\mathbb{R}^n\), but it is not the only one. A general inner product is any function \(\langle \cdot, \cdot \rangle\) that is bilinear, symmetric, and positive definite (\(\langle \mathbf{v}, \mathbf{v} \rangle \ge 0\) with equality only at \(\mathbf{v} = \mathbf{0}\)). Any symmetric positive-definite matrix \(\mathbf{M}\) defines an inner product by \(\langle \mathbf{u}, \mathbf{v} \rangle_\mathbf{M} = \mathbf{u}^\top \mathbf{M} \mathbf{v}\). When you encounter a "Mahalanobis distance" or a "weighted inner product," it is one of these.
The vectors \(\mathbf{u} = (1, 2)^\top\) and \(\mathbf{v} = (3, -1)^\top\). Compute their inner product. Are they orthogonal? What is the cosine of the angle between them?
13. Orthonormal bases and Gram-Schmidt
Bases come in two flavours that are technically equivalent but practically very different. Any linearly independent spanning set works as a basis. But an orthonormal basis is one where the basis vectors are pairwise orthogonal and each has length 1. The standard basis \(\{\mathbf{e}_1, \ldots, \mathbf{e}_n\}\) is orthonormal. Most other bases are not.
Why orthonormal bases are special: in an orthonormal basis, the coordinates of a vector are just its inner products with the basis vectors. If \(\{\mathbf{q}_1, \ldots, \mathbf{q}_n\}\) is orthonormal, then for any \(\mathbf{v}\):
No system of equations to solve, no matrix to invert. Just a dot product per coordinate. Compare this to a general basis, where you have to solve \(\mathbf{B}\mathbf{x} = \mathbf{v}\) to find the coordinates. Orthonormal bases give you coordinates for free.
The Gram-Schmidt procedure
Given any linearly independent set, you can produce an orthonormal basis for the same span using Gram-Schmidt. The algorithm:
- Take the first vector and normalise it: \(\mathbf{q}_1 = \mathbf{v}_1 / \|\mathbf{v}_1\|\).
- For each subsequent \(\mathbf{v}_k\), subtract off its projections onto the already-found \(\mathbf{q}_i\)'s, then normalise: \(\mathbf{w}_k = \mathbf{v}_k - \sum_{i<k} \langle \mathbf{v}_k, \mathbf{q}_i \rangle \mathbf{q}_i\), then \(\mathbf{q}_k = \mathbf{w}_k / \|\mathbf{w}_k\|\).
The result is a sequence \(\mathbf{q}_1, \mathbf{q}_2, \ldots\) that is orthonormal and spans the same subspace as the original sequence. In numerical practice, the classical Gram-Schmidt is unstable, and the modified Gram-Schmidt or Householder reflections are used instead. Both produce the same theoretical result.
Gram-Schmidt is the engine behind QR decomposition, which factors any matrix \(\mathbf{A} = \mathbf{Q}\mathbf{R}\) where \(\mathbf{Q}\) has orthonormal columns and \(\mathbf{R}\) is upper triangular. QR is the workhorse for solving least-squares problems.
Apply Gram-Schmidt to the vectors \(\mathbf{v}_1 = (1, 1)^\top\) and \(\mathbf{v}_2 = (1, 0)^\top\). What orthonormal basis do you get?
14. Orthogonal projections
An orthogonal projection sends a vector to its closest point in a subspace. Given a subspace \(U \subset \mathbb{R}^n\) and a vector \(\mathbf{v}\), the projection \(\pi_U(\mathbf{v})\) is the unique vector in \(U\) such that the difference \(\mathbf{v} - \pi_U(\mathbf{v})\) is orthogonal to every vector in \(U\).
Projection onto a line
The simplest case: project \(\mathbf{v}\) onto the line spanned by a unit vector \(\mathbf{u}\). The answer is
The inner product picks out the length of the shadow; the direction is the unit vector. This formula is everywhere.
Projection onto a subspace
If \(U\) has an orthonormal basis \(\{\mathbf{q}_1, \ldots, \mathbf{q}_k\}\), the projection sums the line projections:
If you stack the basis vectors as columns of a matrix \(\mathbf{Q}\), this becomes \(\pi_U(\mathbf{v}) = \mathbf{Q}\mathbf{Q}^\top \mathbf{v}\). The matrix \(\mathbf{P} = \mathbf{Q}\mathbf{Q}^\top\) is the projection matrix onto \(U\). It satisfies \(\mathbf{P}^2 = \mathbf{P}\) (projecting twice is the same as projecting once) and \(\mathbf{P} = \mathbf{P}^\top\) (it is symmetric for orthogonal projections).
Projection onto the column space of a general matrix
If \(\mathbf{A}\) is an \(n \times k\) matrix with full column rank (but its columns are not necessarily orthonormal), the projection onto its column space is
This formula will become familiar when we reach linear regression in Part VIII. The least-squares solution is exactly the orthogonal projection of the target onto the column space of the feature matrix.
Why projections matter
Projections are how you take a point and find its closest representative in a constrained set. PCA is a sequence of projections. Linear regression is a projection. The Kalman filter is a projection. Anywhere you have a "closest point in a subspace" problem, the answer is a projection, and the geometry is what you saw above.
Project the vector \((3, 4)^\top\) onto the line spanned by \((1, 0)^\top\). What is the residual (the part of the vector that is orthogonal to the line)?
15. Rotations
A rotation is a linear map that preserves both lengths and angles. In 2D, every rotation is given by a matrix of the form
which rotates counterclockwise by angle \(\theta\). The columns of \(\mathbf{R}(\theta)\) are exactly the images of \(\mathbf{e}_1\) and \(\mathbf{e}_2\) after rotation: \((\cos\theta, \sin\theta)^\top\) and \((-\sin\theta, \cos\theta)^\top\). The minus sign in the second column comes from the geometry: rotating \(\mathbf{e}_2\) counterclockwise by \(\theta\) takes it into the second quadrant.
Properties
Rotation matrices are orthogonal: they satisfy \(\mathbf{R}^\top \mathbf{R} = \mathbf{I}\), or equivalently \(\mathbf{R}^{-1} = \mathbf{R}^\top\). The inverse of a rotation is just its transpose, and the transpose is the rotation by \(-\theta\). The determinant of a rotation matrix is exactly 1. (Orthogonal matrices with determinant \(-1\) are reflections.)
Composing rotations adds the angles: \(\mathbf{R}(\alpha) \mathbf{R}(\beta) = \mathbf{R}(\alpha + \beta)\). This is one of the few cases where matrix multiplication is commutative, because rotation by \(\alpha\) and then \(\beta\) is the same as rotating by \(\beta\) then \(\alpha\) when both are around the same axis.
Higher dimensions
In 3D, a rotation is a matrix that rotates around some axis. There are several common parameterisations: Euler angles, axis-angle, rotation matrices directly, or unit quaternions. Each has trade-offs (Euler angles have the gimbal lock problem; quaternions have a sign ambiguity). For ML purposes, the rotation matrix form is enough: an orthogonal matrix with determinant 1.
In \(n\) dimensions, the set of all rotation matrices forms a group called \(SO(n)\) (the "special orthogonal group"). It comes up whenever your data lives on a sphere or you need to rotate a high-dimensional embedding without distortion.
Apply the 90-degree rotation matrix \(\mathbf{R}(\pi/2)\) to the vector \((1, 0)^\top\). What do you get? Does the answer match your geometric expectation?
16. Inner products of functions
Everything in this part has been about \(\mathbb{R}^n\), but the linear-algebra framework extends, almost without modification, to function spaces. The vector space here is the set of functions on some interval, and the inner product is an integral instead of a sum:
This satisfies all the inner product axioms: bilinear, symmetric, and positive definite (assuming we restrict to functions for which the integral exists and is finite). The corresponding norm is
called the \(L^2\) norm. The angle between two functions, the orthogonality of two functions, and the projection of one function onto another all transfer directly from the vector case. Two functions are orthogonal if their inner product (the integral of their pointwise product) is zero.
Why this matters
Many tools in ML and signal processing rely on functions being treated as vectors in an infinite-dimensional space. Fourier analysis writes a function as a sum of orthogonal sines and cosines: it is a basis expansion in a function space. Wavelets do the same with a different orthonormal basis. Reproducing kernel Hilbert spaces (the formal foundation of kernel methods, including the kernel SVM in Part XI) are inner-product function spaces with an extra structural property. All of these inherit the geometric intuition from \(\mathbb{R}^n\).
You will rarely need to compute these integrals by hand for ML, but you should know that the framework exists and that the same projection-and-norm machinery applies. When you read about a "function space" or an "RKHS" or a "Hilbert space," it is the same set of ideas as Part II, scaled up.
Are the functions \(\sin(x)\) and \(\cos(x)\) orthogonal on the interval \([0, 2\pi]\)? What about \(\sin(x)\) and \(\sin(2x)\)?
Part III. Matrix decompositions
Factoring matrices into useful pieces.
A square matrix is hard to understand at a glance, but every interesting matrix can be factored into a product of matrices that are individually easy. Each factorisation reveals a different structural truth. Eigendecomposition reveals which directions are stretched and which are unchanged. SVD reveals the same idea for any matrix, square or not. The decompositions are how you actually see what a matrix is doing.
17. Determinant and trace
Two single numbers you can compute from a square matrix that summarise it. Both will appear constantly.
The determinant
The determinant of a square matrix \(\mathbf{A}\), written \(\det(\mathbf{A})\) or \(|\mathbf{A}|\), is the signed volume scaling factor of the linear map \(\mathbf{A}\). If you take the unit cube and apply \(\mathbf{A}\) to it, the result is a parallelepiped whose volume is \(|\det(\mathbf{A})|\). The sign is positive if \(\mathbf{A}\) preserves orientation and negative if it flips it.
Three consequences of this definition:
- \(\det(\mathbf{A}) = 0\) means the unit cube collapses to a lower-dimensional shape. Equivalently: the columns of \(\mathbf{A}\) are linearly dependent, the kernel is non-trivial, and \(\mathbf{A}\) is not invertible. Determinant zero is the signature of a non-invertible matrix.
- \(\det(\mathbf{A}\mathbf{B}) = \det(\mathbf{A}) \det(\mathbf{B})\). Composition of maps multiplies their volume scalings.
- \(\det(\mathbf{A}^{-1}) = 1/\det(\mathbf{A})\) when the inverse exists. Undoing a stretch is the reciprocal stretch.
For small matrices the formulas are explicit. The 2x2 determinant is \(\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc\). The 3x3 case has six terms (the Sarrus rule). For general \(n\), the determinant is a sum of \(n!\) terms, which is computationally infeasible past about \(n=4\). In practice you compute \(\det(\mathbf{A})\) via LU decomposition, where it falls out as the product of the diagonal of the upper triangular factor.
The trace
The trace of a square matrix is the sum of its diagonal entries: \(\text{tr}(\mathbf{A}) = A_{11} + A_{22} + \cdots + A_{nn}\). Trace has fewer geometric properties than the determinant, but it has one near-magical algebraic property:
Even when \(\mathbf{A}\mathbf{B}\) and \(\mathbf{B}\mathbf{A}\) are different matrices (or different shapes), their traces are equal. This trick (called the cyclic property of the trace, since you can also rotate triple products) is used constantly in matrix calculus and in deriving update rules for ML algorithms.
Trace also equals the sum of the eigenvalues (Module 18), and determinant equals their product. Two ways to summarise spectral information without diagonalising.
Compute the determinant and trace of \(\mathbf{A} = \begin{pmatrix} 3 & 1 \\ 2 & 4 \end{pmatrix}\). Is it invertible?
18. Eigenvalues and eigenvectors
An eigenvector of a square matrix \(\mathbf{A}\) is a non-zero vector \(\mathbf{v}\) that does not change direction when \(\mathbf{A}\) acts on it. It might get longer, shorter, or flipped, but it stays on its own line:
The scalar \(\lambda\) is the corresponding eigenvalue: it tells you how much \(\mathbf{v}\) gets stretched. Eigenvalues can be positive (stretches), negative (flips and stretches), zero (the eigenvector lies in the kernel), or even complex (when \(\mathbf{A}\) acts as a rotation in some plane).
Finding them
Rearranging \(\mathbf{A}\mathbf{v} = \lambda\mathbf{v}\) gives \((\mathbf{A} - \lambda \mathbf{I})\mathbf{v} = \mathbf{0}\). For non-zero \(\mathbf{v}\) to exist, \(\mathbf{A} - \lambda\mathbf{I}\) must be singular, which means its determinant is zero:
This is the characteristic equation. It is a polynomial in \(\lambda\) of degree \(n\) (for an \(n \times n\) matrix). Its roots are the eigenvalues, counted with multiplicity. The fundamental theorem of algebra says it has exactly \(n\) roots over the complex numbers, though some may be repeated and some may be complex even when the matrix is real.
For each eigenvalue \(\lambda_i\), the corresponding eigenvectors form the kernel of \(\mathbf{A} - \lambda_i \mathbf{I}\). You find them by solving the linear system.
The geometric picture
For a 2x2 matrix in the real plane, the eigenvectors are the directions that the matrix leaves invariant. If you draw a circle of vectors and apply \(\mathbf{A}\) to each, most will rotate to a new direction. The eigenvectors are the rare ones that stay on their own line. The eigenvalues say by how much they got scaled.
Special cases worth remembering
- Symmetric matrices have real eigenvalues and orthogonal eigenvectors. This is one of the most-used facts in ML.
- Positive-definite matrices (symmetric with all positive eigenvalues) define inner products, generate ellipses (not just any quadric), and are the matrices you can take a Cholesky decomposition of.
- Orthogonal matrices (rotations and reflections) have eigenvalues with absolute value 1. Rotations have complex-conjugate pairs of eigenvalues on the unit circle. Reflections have a real eigenvalue of \(-1\) (the flipped direction) and one of \(+1\) (the axis of reflection).
The matrix \(\mathbf{A} = \begin{pmatrix} 2 & 0 \\ 0 & 3 \end{pmatrix}\) has obvious eigenvectors and eigenvalues. What are they? Why does the structure of this matrix make it easy?
19. Diagonalisation
If a square matrix \(\mathbf{A}\) has \(n\) linearly independent eigenvectors, you can write
where \(\mathbf{P}\) is the matrix whose columns are the eigenvectors and \(\mathbf{D}\) is the diagonal matrix of corresponding eigenvalues. This is called diagonalisation, and you should read the formula geometrically.
\(\mathbf{P}^{-1}\): change basis from standard coordinates to the eigenvector basis. \(\mathbf{D}\): in the eigenvector basis, the action of \(\mathbf{A}\) is just stretching each axis independently. \(\mathbf{P}\): change basis back. The complicated map \(\mathbf{A}\) becomes the simple map \(\mathbf{D}\) once you choose the right coordinate system.
When can you diagonalise?
Not every matrix is diagonalisable. The condition is that the matrix has \(n\) linearly independent eigenvectors. Some matrices have repeated eigenvalues and not enough eigenvectors to fill out a full basis (defective matrices). Examples include the shear matrix \(\begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}\), which has eigenvalue 1 with only a 1D eigenspace.
A guaranteed-diagonalisable case: symmetric matrices are always diagonalisable, and you can pick the eigenvectors to be orthonormal. The result is the spectral theorem:
where \(\mathbf{Q}\) is orthogonal (\(\mathbf{Q}^\top = \mathbf{Q}^{-1}\)). Symmetric matrices are everywhere in ML (covariance, Gram matrix, Hessian), and the spectral theorem is what lets us always factor them this way.
Computing matrix powers
One immediate payoff of diagonalisation: \(\mathbf{A}^k = \mathbf{P}\mathbf{D}^k \mathbf{P}^{-1}\), and \(\mathbf{D}^k\) is just the diagonal matrix of \(\lambda_i^k\). Powers of a diagonalisable matrix are computed in \(O(n^3)\) plus \(n\) scalar exponentiations, which is dramatically faster than naively multiplying \(k\) times. This shows up in PageRank, Markov chains, and any iterated linear dynamic.
Why is the matrix \(\begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}\) not diagonalisable? Find its eigenvalues and eigenvectors and explain the issue.
20. Cholesky decomposition
For symmetric positive-definite matrices, there is an even sharper factorisation: the Cholesky decomposition.
where \(\mathbf{L}\) is lower triangular with positive diagonal. Cholesky exists if and only if \(\mathbf{A}\) is symmetric positive-definite, and when it does, it is unique. The algorithm to compute it is essentially Gaussian elimination tailored to the symmetric case, and it is twice as fast as a general LU decomposition.
Where it shows up
- Solving \(\mathbf{A}\mathbf{x} = \mathbf{b}\) for symmetric positive-definite \(\mathbf{A}\). Find \(\mathbf{L}\), then solve \(\mathbf{L}\mathbf{y} = \mathbf{b}\) by forward substitution and \(\mathbf{L}^\top \mathbf{x} = \mathbf{y}\) by back substitution. Both triangular solves are \(O(n^2)\) and the factorisation is \(O(n^3 / 3)\).
- Sampling from a multivariate Gaussian \(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\). If \(\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^\top\), then \(\boldsymbol{\mu} + \mathbf{L}\mathbf{z}\) where \(\mathbf{z}\) is standard normal has the right covariance. We will revisit this in Part V.
- Computing log-determinants. \(\log \det \mathbf{A} = 2 \sum_i \log L_{ii}\). This is needed in Gaussian process and Bayesian methods, where you compute log-likelihoods.
Cholesky is the standard tool for "I have a positive-definite matrix and I need to do something with it numerically." If your matrix is barely positive-definite (close to singular), Cholesky gives you the cleanest signal: the algorithm fails (tries to take the square root of a negative number) at the first eigenvalue near zero.
The matrix \(\begin{pmatrix} 4 & 2 \\ 2 & 5 \end{pmatrix}\) is symmetric positive-definite. Find its Cholesky decomposition by hand.
21. Singular value decomposition
Eigendecomposition only works for square matrices, and even then only when there are enough eigenvectors. The singular value decomposition (SVD) generalises the same idea to any matrix.
Every \(m \times n\) matrix \(\mathbf{A}\) (real or complex, square or rectangular, full rank or not) can be factored as
where \(\mathbf{U}\) is an \(m \times m\) orthogonal matrix, \(\mathbf{V}\) is an \(n \times n\) orthogonal matrix, and \(\mathbf{\Sigma}\) is an \(m \times n\) "diagonal" matrix (zero except on the main diagonal) whose diagonal entries are non-negative. These diagonal entries are the singular values of \(\mathbf{A}\), conventionally ordered \(\sigma_1 \ge \sigma_2 \ge \cdots \ge 0\).
Geometric picture
Read \(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\) right to left as the action on a vector: \(\mathbf{V}^\top\) rotates the input into a special basis (the right singular vectors). \(\mathbf{\Sigma}\) stretches each axis by the corresponding singular value (and zeros out axes that have no contribution to the output, if \(m > n\) or \(\mathbf{A}\) is not full rank). \(\mathbf{U}\) rotates the result into the output space.
Equivalently: any linear map is "rotate, stretch each axis independently, rotate again." That is the entire content of SVD. Eigendecomposition is the special case where the input rotation and the output rotation are the same.
What the singular values tell you
The number of non-zero singular values is the rank of \(\mathbf{A}\). The largest singular value \(\sigma_1\) is the operator norm: it is the largest factor by which \(\mathbf{A}\) can stretch any unit vector. The ratio \(\sigma_1 / \sigma_n\) (when \(\sigma_n > 0\)) is the condition number, which measures how sensitive \(\mathbf{A}\mathbf{x} = \mathbf{b}\) is to perturbations: a high condition number means small errors in \(\mathbf{b}\) cause large errors in \(\mathbf{x}\).
Relationship to eigendecomposition
The columns of \(\mathbf{V}\) are the eigenvectors of \(\mathbf{A}^\top \mathbf{A}\). The columns of \(\mathbf{U}\) are the eigenvectors of \(\mathbf{A}\mathbf{A}^\top\). The singular values squared are the eigenvalues of either of these symmetric matrices. So if you know how to compute symmetric eigendecomposition, you know how to compute SVD.
If \(\mathbf{A}\) is a \(5 \times 3\) matrix of rank 2, how many singular values does it have? How many of those are zero?
22. Low-rank approximation
The single most important practical use of SVD: compressing matrices.
Write \(\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\). Each column of \(\mathbf{U}\) and \(\mathbf{V}\) pairs up with one singular value to give a rank-one matrix \(\sigma_i \mathbf{u}_i \mathbf{v}_i^\top\). The full SVD is the sum of these:
where \(r\) is the rank. The truncated SVD keeps only the first \(k\) terms (the \(k\) largest singular values):
The Eckart-Young theorem says \(\mathbf{A}_k\) is the best rank-\(k\) approximation to \(\mathbf{A}\) in both Frobenius norm and operator norm. There is no other rank-\(k\) matrix closer to \(\mathbf{A}\).
Why this matters
Most real-world matrices are approximately low-rank. An image matrix has highly correlated rows and columns; most of its singular values are tiny. A document-term matrix is dominated by a small number of topics. A user-movie ratings matrix is well-approximated by a few latent factors. In each case, replacing \(\mathbf{A}\) with \(\mathbf{A}_k\) for a modest \(k\) loses very little information but compresses the matrix from \(mn\) numbers down to \(k(m+n+1)\) numbers (the singular values, the columns of \(\mathbf{U}_k\), and the columns of \(\mathbf{V}_k\)).
This is the engine behind PCA (Part IX), latent semantic analysis, matrix factorisation for collaborative filtering, image compression, and a long list of other techniques. They are all variations on "take SVD, keep the top \(k\)."
The "energy" perspective
The squared Frobenius norm of \(\mathbf{A}\) equals the sum of squared singular values: \(\|\mathbf{A}\|_F^2 = \sum_i \sigma_i^2\). The fraction of "energy" captured by the rank-\(k\) approximation is \(\sum_{i \le k} \sigma_i^2 / \sum_i \sigma_i^2\). When this fraction is close to 1 for small \(k\), the matrix is effectively low-rank and compression works well. Plotting singular values in descending order is a quick visual diagnostic for this.
If a \(1000 \times 1000\) matrix has singular values \(100, 95, 80, 5, 4, 3, \ldots\) (sharp drop after the first three), what rank \(k\) would give a good approximation?
23. The phylogeny of matrices
Different matrices admit different decompositions. Knowing which one to reach for is half the work of efficient numerical linear algebra.
| Matrix property | Best decomposition | Cost |
|---|---|---|
| Any matrix | SVD: \(\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\) | \(O(mn \min(m,n))\) |
| Square, diagonalisable | Eigendecomp: \(\mathbf{A} = \mathbf{P}\mathbf{D}\mathbf{P}^{-1}\) | \(O(n^3)\) |
| Symmetric | Spectral: \(\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^\top\) | \(O(n^3)\) |
| Symmetric positive-definite | Cholesky: \(\mathbf{A} = \mathbf{L}\mathbf{L}^\top\) | \(O(n^3 / 3)\) |
| Tall and full rank | QR: \(\mathbf{A} = \mathbf{Q}\mathbf{R}\) | \(O(mn^2)\) |
| Square, generic | LU: \(\mathbf{A} = \mathbf{L}\mathbf{U}\) | \(O(2n^3 / 3)\) |
What each is good for
- SVD. Universally applicable. Use it when you need rank, when you need to compress, when you need the pseudoinverse, when nothing else fits.
- Eigendecomposition. Square matrices. Use it for matrix powers, exponentials, principal directions of action.
- Spectral (symmetric eigendecomp). The special case for symmetric matrices. Use it for covariance matrices, Hessians, kernel matrices.
- Cholesky. Symmetric positive-definite. Use it for Gaussian sampling, fast solving of linear systems, computing log-determinants.
- QR. Use it for least-squares problems and Gram-Schmidt.
- LU. The default for "solve any square system." It is what your numerical library does behind the scenes when you call
solve(A, b).
That closes Part III. Decompositions are how you really see what a matrix does, and they are the engine inside every machine learning algorithm that involves a matrix problem (which is most of them). The next part adds calculus on top: how matrix-valued functions change with respect to their inputs.
Given a covariance matrix and asked to sample from the corresponding Gaussian, which decomposition would you reach for? Why?
Part IV. Vector calculus
How a function of many variables changes.
Single-variable calculus tells you how a function of one variable changes. Vector calculus extends that to functions of many variables, and to functions whose outputs are themselves vectors or matrices. Almost every machine learning algorithm trains by computing a gradient and stepping in the direction it points. To understand training, you need to be fluent in the gradient.
24. Differentiation, refresher
The derivative of a single-variable function \(f : \mathbb{R} \to \mathbb{R}\) at a point \(x\) is the limit
Geometrically: the slope of the tangent line at \(x\). Algorithmically: the local rate of change. The function increases at rate \(f'(x)\) per unit increase in \(x\) near that point.
The derivatives you should be able to compute in your sleep:
- Power rule. \(\frac{d}{dx} x^n = n x^{n-1}\).
- Sum rule. \((f + g)' = f' + g'\).
- Product rule. \((fg)' = f'g + fg'\).
- Quotient rule. \((f/g)' = (f'g - fg')/g^2\).
- Chain rule. \(\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)\). The most important rule. We will see it again in higher dimensions.
The functions you should know derivatives of by sight: \(e^x \to e^x\), \(\log x \to 1/x\), \(\sin x \to \cos x\), \(\cos x \to -\sin x\), \(\tan x \to \sec^2 x\). Plus the inverse functions: \(\arcsin' x = 1/\sqrt{1-x^2}\), and similar.
Numerical reality
In practice, almost no derivative in machine learning is computed by hand. Frameworks do automatic differentiation: they record the computation graph as you build it, then traverse it backwards to compute gradients. We will look at how this works in Module 29. The hand computation is still useful because it gives you an intuition for which functions are easy to differentiate and which are pathological. Things like absolute values and sharp corners are pathological; smooth functions of standard pieces are easy.
What is the derivative of \(f(x) = \log(1 + e^x)\)? (This is the "softplus" function used in neural networks. Its derivative has a famous form.)
25. Partial derivatives and the gradient
For a function of multiple variables \(f : \mathbb{R}^n \to \mathbb{R}\), there is no single "derivative." Instead, there is one partial derivative for each input variable, holding the others fixed:
It is the rate of change of \(f\) when you move in the \(i\)-th coordinate direction only.
Stack all the partial derivatives together and you get the gradient:
The gradient is a vector. It lives in the same space as the input \(\mathbf{x}\), not in the output space. The two non-negotiable facts about it:
- It points in the direction of steepest ascent. Among all directions you could step in from \(\mathbf{x}\), the gradient direction is the one along which \(f\) increases fastest.
- Its length is the rate of that increase. The function increases by \(\|\nabla f\| \cdot \epsilon\) (to first order) per step of size \(\epsilon\) in the gradient direction.
Both facts follow from the directional derivative: the rate of change of \(f\) in direction \(\mathbf{u}\) (a unit vector) is \(\nabla f \cdot \mathbf{u}\). By Cauchy-Schwarz (Module 12), this is maximised when \(\mathbf{u}\) is parallel to \(\nabla f\).
Critical points
A point where \(\nabla f = \mathbf{0}\) is a critical point. At a critical point, the function is locally flat: any direction you move, the function does not change to first order. Critical points come in three flavours: local minima, local maxima, and saddle points. To distinguish them you need second-order information (the Hessian, Module 30).
Optimisation, in the simplest framing, is "find a critical point that is also a minimum." We will spend most of Part VI on the algorithms for doing this.
For \(f(x, y) = x^2 + 3xy + y^2\), compute \(\nabla f\). What is its value at the origin? What does that tell you?
26. The Jacobian
For a function \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\) (vector input, vector output), the gradient generalises to the Jacobian: an \(m \times n\) matrix of all the partial derivatives.
The Jacobian is the best linear approximation of \(\mathbf{f}\) at the point \(\mathbf{x}\). Near \(\mathbf{x}\), the function behaves approximately as
This is the multidimensional version of the tangent line approximation. The Jacobian matrix \(\mathbf{J}_{\mathbf{f}}(\mathbf{x})\) is the linear map that captures, locally, how the function deforms space.
Special cases worth naming
- If \(\mathbf{f}\) outputs a scalar (\(m = 1\)), the Jacobian is a \(1 \times n\) row vector. By convention, its transpose is the gradient.
- If \(\mathbf{f}\) takes a scalar input (\(n = 1\)), the Jacobian is an \(m \times 1\) column vector: just the derivative of each output component.
- If \(\mathbf{f}\) is itself linear (i.e., \(\mathbf{f}(\mathbf{x}) = \mathbf{A}\mathbf{x}\)), the Jacobian is constantly \(\mathbf{A}\) at every point. Linear functions are their own best linear approximation.
Determinant of the Jacobian
For a square Jacobian (\(m = n\)), the determinant tells you the local volume scaling of the map. This is the change-of-variables formula in multidimensional integration: when you change variables from \(\mathbf{x}\) to \(\mathbf{u} = \mathbf{f}(\mathbf{x})\), the volume element transforms as \(d\mathbf{u} = |\det \mathbf{J}_{\mathbf{f}}| \, d\mathbf{x}\). It is the multidimensional analogue of \(du = f'(x) \, dx\) in single-variable substitution.
This shows up in probability when you push a random variable through a function and want to track how its density changes. We will use it explicitly in Module 41.
For \(\mathbf{f}(x, y) = (x^2 + y, xy)\), compute the Jacobian. What is its determinant?
27. Matrix calculus rules
Matrix calculus is what you actually do in ML: take a derivative of a scalar with respect to a vector or matrix, or a vector with respect to a vector. The rules are the same as ordinary calculus, but the bookkeeping is fiddly because shapes have to align.
Two conventions
For \(f : \mathbb{R}^n \to \mathbb{R}\), there are two ways to organise \(\partial f / \partial \mathbf{x}\). The "numerator layout" makes it a row vector. The "denominator layout" makes it a column vector. Modern ML mostly uses the denominator layout, where the gradient is a column vector matching the shape of \(\mathbf{x}\). This note uses denominator layout throughout.
Useful identities
For vector \(\mathbf{x}\) and matrix \(\mathbf{A}\):
The third one collapses to \(2\mathbf{A}\mathbf{x}\) when \(\mathbf{A}\) is symmetric, which it is in essentially every place this expression shows up in ML (least squares, quadratic forms, Hessians).
For matrix \(\mathbf{X}\):
Where it gets messy
The minute you write a derivative of a matrix-valued function with respect to a matrix, you have a four-index tensor on your hands. You can vectorise to keep things in matrix form, or you can use Einstein summation conventions, or you can just trust your autodiff library. In practice, hand-derivations stop at the level of "scalar with respect to a vector or matrix"; anything more goes through automatic differentiation (Module 29).
The single best resource for matrix calculus identities is Petersen and Pedersen's "The Matrix Cookbook," which is freely available. Bookmark it. The first two pages contain about 80% of what you need.
For the squared-error loss \(L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2\), where \(\mathbf{y}\) and \(\mathbf{X}\) are fixed, compute \(\partial L / \partial \mathbf{w}\).
28. The chain rule, in full
The single-variable chain rule says \(\frac{d}{dx}f(g(x)) = f'(g(x)) g'(x)\). The multivariate version is the same idea: derivatives compose by multiplication. The bookkeeping just gets fancier.
The general statement
For composed functions \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\) and \(\mathbf{g} : \mathbb{R}^p \to \mathbb{R}^n\), the Jacobian of \(\mathbf{f} \circ \mathbf{g}\) at a point \(\mathbf{x}\) is the product of Jacobians:
Read it left to right: first apply the outer function's Jacobian (evaluated at the inner function's output), then the inner function's Jacobian (evaluated at the original input). The dimensions line up: \((m \times n)\) times \((n \times p)\) gives \((m \times p)\), which is the correct shape for the Jacobian of a function from \(\mathbb{R}^p\) to \(\mathbb{R}^m\).
The scalar-output special case
If the final output is a scalar (the loss, in ML), the chain rule reduces to a series of matrix-vector multiplications. For \(L = f(\mathbf{a})\) where \(\mathbf{a} = \mathbf{g}(\mathbf{x})\):
The transpose of the Jacobian acts on the upstream gradient to produce the downstream gradient. This single equation is the engine of backpropagation.
Why this is the entire story of ML training
A neural network is a deeply composed function: \(L = \ell(f_K(f_{K-1}(\cdots f_1(\mathbf{x}))))\), where each \(f_i\) is one layer (a linear map followed by a non-linearity). To compute \(\partial L / \partial \theta_i\) for some parameter in layer \(i\), you apply the chain rule \(K - i + 1\) times. Conceptually, that is the whole gradient computation.
The clever part is the order of evaluation, which we cover next.
For \(L = (\sigma(\mathbf{w}^\top \mathbf{x}) - y)^2\) (squared error of a single neuron), apply the chain rule to compute \(\partial L / \partial \mathbf{w}\).
29. Backpropagation
Suppose your computation graph is a long chain: \(\mathbf{x}_0 \to \mathbf{x}_1 \to \mathbf{x}_2 \to \cdots \to \mathbf{x}_K = L\). The chain rule says \(\partial L / \partial \mathbf{x}_0\) is a product of \(K\) Jacobians. The question is in what order you multiply them.
Two natural orders:
- Forward mode. Start with the identity at the input, multiply by the first Jacobian, multiply by the second, and so on. After step \(k\) you have \(\partial \mathbf{x}_k / \partial \mathbf{x}_0\). At the end you have the full gradient.
- Reverse mode. Start with the identity at the output, multiply by the last Jacobian (transposed), then the second-to-last, and so on. After step \(k\) you have \(\partial L / \partial \mathbf{x}_{K-k}\). At the end you have what you want.
Mathematically, both produce the same answer. Computationally, they differ enormously. The cost of multiplying a chain of matrices \(\mathbf{A}_1 \mathbf{A}_2 \cdots \mathbf{A}_K\) depends on the shapes. Forward mode is cheap when the input is low-dimensional and the output is high-dimensional. Reverse mode is cheap when the input is high-dimensional and the output is low-dimensional.
In ML, the loss is a single scalar (output dimension 1) and the parameters are very high-dimensional (millions, sometimes billions). Reverse mode is dramatically faster. Backpropagation is reverse-mode automatic differentiation applied to neural network training. The algorithm is generic and works on any computation graph.
The two-pass structure
To compute the gradient of a loss with respect to all parameters, autodiff frameworks do two passes through the computation graph:
- Forward pass. Evaluate the function at the input, recording the value of every intermediate quantity. Output: the loss, plus a saved tape of intermediate values.
- Backward pass. Walk the tape in reverse, accumulating gradients. At each node you know the upstream gradient (\(\partial L / \partial \mathbf{x}_k\)) and you compute the local Jacobian and use the chain rule to get the gradient at the next node back.
The total cost is roughly twice the cost of the forward pass, regardless of the number of parameters. This is the deep algorithmic fact that makes deep learning practical.
Why is reverse-mode autodiff (backprop) faster than forward-mode for neural network training? Why is the opposite true for, say, computing all sensitivities of a single-output simulator with respect to its many inputs? (Trick question: it isn't.)
30. The Hessian and second-order info
The gradient is a first-order object: it tells you the slope. To know whether you are at a minimum, a maximum, or a saddle point, you need second derivatives.
The Hessian of a scalar function \(f : \mathbb{R}^n \to \mathbb{R}\) is the \(n \times n\) matrix of second partial derivatives:
Symmetric, by the equality of mixed partials (assuming \(f\) is twice continuously differentiable, which it is for everything you will see in this note).
What the Hessian tells you
At a critical point (\(\nabla f = \mathbf{0}\)), the Hessian classifies it:
- If \(\mathbf{H}\) is positive definite (all eigenvalues positive), the critical point is a local minimum. The function curves upward in every direction.
- If negative definite (all eigenvalues negative), local maximum.
- If the eigenvalues are mixed (some positive, some negative), it is a saddle point. The function curves up in some directions and down in others.
- If some eigenvalues are zero, the second derivative test is inconclusive; you need higher-order information.
Newton's method
Gradient descent uses only first-order information and takes steps proportional to the gradient. Newton's method uses both: it solves the local quadratic approximation exactly. The update rule is
It converges quadratically near a minimum (the number of correct digits doubles each step), but each step requires solving an \(n \times n\) linear system, which is \(O(n^3)\). For ML problems with millions of parameters, full Newton is infeasible. Quasi-Newton methods like L-BFGS approximate the Hessian inverse with a few low-rank updates and are widely used for medium-scale problems.
For deep learning, plain stochastic gradient descent and its momentum-based variants are usually preferred over Newton, because computing or even approximating the Hessian is too expensive. We come back to this in Part VI.
For \(f(x, y) = x^2 + 4xy + y^2\), compute the Hessian. Is the origin a minimum, maximum, or saddle point?
31. Multivariate Taylor series
The single-variable Taylor series approximates a function near a point as a polynomial:
The multivariate version uses gradient and Hessian:
Truncated at first order, it gives the tangent plane (the multidimensional analogue of the tangent line). Truncated at second order, it gives the local quadratic bowl that captures both slope and curvature. Higher-order terms involve tensors of partial derivatives and are rarely worth the trouble.
Where it shows up
- Newton's method minimises this second-order approximation exactly, then re-approximates at the new point.
- Local convergence analysis of optimisation algorithms uses Taylor expansions to argue about behaviour near a critical point.
- Trust region methods use a quadratic Taylor model within a region they "trust" to be accurate, and shrink the region if the actual function disagrees.
- Linearisation in physics and signal processing is the first-order Taylor approximation: replace the non-linear function by its tangent and analyse a much simpler linear problem.
That closes Part IV. You can now compute gradients, Jacobians, and Hessians, and you understand backpropagation as reverse-mode chain rule. The next part is a parallel chapter on probability, which gives you the language to talk about uncertain data.
Write the second-order Taylor expansion of \(f(x, y) = e^{xy}\) around the origin.
Part V. Probability
The language of uncertain data.
Real data is noisy. Real predictions are uncertain. Real models are wrong in interesting ways. Probability is the formal language for talking about all of that. Once you can write a model as a joint distribution and update it with data, you have the framework that nearly every machine learning algorithm is a special case of.
32. What probability is
A probability is a number between 0 and 1 assigned to an event. The interpretation has been argued for three centuries and there are two camps that you should know about.
Frequentist
"The probability of an event is the long-run fraction of trials in which it occurs." Flip a fair coin many times; the fraction of heads converges to 1/2. Probability is a property of the world, measured by repeating an experiment.
Bayesian
"The probability of an event is your degree of belief in it." There may be no repetition; the event might be a one-off (will it rain tomorrow? Did the defendant commit the crime?). Probability is a property of the observer's knowledge, updated by data via Bayes' rule.
The two interpretations agree in cases where both make sense, and modern ML uses both fluidly without much philosophical anxiety. We will use whichever makes a particular argument easier. Both views obey the same mathematical rules, which is what we care about.
The axioms
Whatever a probability "is," it satisfies three rules. For events on a sample space \(\Omega\):
- \(P(A) \ge 0\) for any event \(A\).
- \(P(\Omega) = 1\). Something must happen.
- If \(A_1, A_2, \ldots\) are pairwise disjoint events, \(P(A_1 \cup A_2 \cup \cdots) = P(A_1) + P(A_2) + \cdots\). The probability of "any of these" is the sum of probabilities, when no two can co-occur.
From these three you derive everything else: \(P(\emptyset) = 0\), \(P(A^c) = 1 - P(A)\), \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\), and so on.
If \(P(A) = 0.6\), \(P(B) = 0.5\), and \(P(A \cap B) = 0.3\), what is \(P(A \cup B)\)? What is \(P(A^c \cap B^c)\)?
33. Discrete random variables
A random variable is a function from the sample space to the real numbers. Roll a die; the random variable might be "the value rolled" (1 through 6) or "1 if even, 0 if odd." Practically, you can ignore the formalism and just think of a random variable as "a number whose value is uncertain."
A discrete random variable takes values in a countable set (often a finite set, like \(\{1, 2, 3, 4, 5, 6\}\) or \(\{0, 1\}\)). It is described by its probability mass function (PMF), which assigns a probability to each possible value:
Common discrete distributions
- Bernoulli\((\theta)\). Single coin flip with bias \(\theta\). Takes value 1 with probability \(\theta\), 0 with probability \(1 - \theta\). The simplest non-trivial distribution.
- Binomial\((n, \theta)\). Sum of \(n\) independent Bernoulli\((\theta)\). Counts how many heads in \(n\) flips. PMF: \(p(k) = \binom{n}{k} \theta^k (1-\theta)^{n-k}\).
- Categorical\((\boldsymbol{\theta})\). One draw from a finite set with probabilities \(\theta_1, \theta_2, \ldots, \theta_K\) (which sum to 1). Bernoulli is the \(K=2\) case.
- Poisson\((\lambda)\). Counts of rare events in a fixed window: emails per minute, defects per chip, soldiers in 1898 Prussia kicked by horses. PMF: \(p(k) = \lambda^k e^{-\lambda} / k!\) for \(k = 0, 1, 2, \ldots\).
- Geometric\((\theta)\). Number of failures before the first success in independent Bernoulli\((\theta)\) trials.
Each one is a model of a specific data-generating process. Picking the right distribution for your problem is half the work of probabilistic modelling.
You flip a coin 10 times and want to know the probability of exactly 3 heads. Which distribution describes this, and what is the answer in terms of its PMF?
34. Continuous random variables
A continuous random variable can take values in an uncountable set, typically an interval of real numbers. The probability of any single exact value is zero (the temperature being exactly 23.000000...°C has probability zero). Instead, you describe a continuous random variable by a probability density function (PDF), \(p(x)\), which has the property that
The integral over the whole real line equals 1. Densities are not probabilities, they have units (probability per unit \(x\)), and a density value can be larger than 1, but they can be integrated to give probabilities of intervals.
The cumulative distribution function
Easier to interpret than the density: the CDF is
Always between 0 and 1. Always non-decreasing. Approaches 0 at \(-\infty\) and 1 at \(+\infty\). The CDF is well-defined for any real-valued random variable, discrete or continuous, which makes it the most universal description.
Common continuous distributions
- Uniform\([a, b]\). Constant density on \([a, b]\), zero elsewhere. The simplest continuous distribution.
- Exponential\((\lambda)\). Density \(p(x) = \lambda e^{-\lambda x}\) for \(x \ge 0\). Models waiting times: how long until the next event in a Poisson process.
- Gaussian (Normal) \(\mathcal{N}(\mu, \sigma^2)\). The bell curve, density \(p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)}\). Module 39 is dedicated to it; you will see it everywhere.
- Beta\((\alpha, \beta)\). Distribution on \([0, 1]\), used to model probabilities themselves. Conjugate prior to the Bernoulli.
- Gamma\((\alpha, \beta)\). Generalisation of the exponential. Used to model positive-valued quantities like waiting times and rates.
For an Exponential\((\lambda)\) random variable, what is the probability that \(X > c\)? Compute it from the density.
35. Joint, marginal, conditional
When you have multiple random variables, three questions arise: what is the joint behaviour, what does each look like by itself, and what does one look like given a value of another?
Joint distribution
The joint distribution of two random variables \(X\) and \(Y\) describes the probability of every pair of values:
Marginal distribution
To get the distribution of \(X\) alone (forgetting about \(Y\)), sum or integrate out the other variable:
This is called marginalisation. It is the workhorse operation of probabilistic inference: whenever you have a joint distribution involving variables you do not care about, you marginalise them away.
Conditional distribution
The distribution of \(X\) given that you know \(Y = y\) is
Read it as "the joint divided by the marginal of the variable you are conditioning on." The conditional is itself a distribution: it sums (or integrates) to 1 over \(x\) for any fixed \(y\) where \(p(y) > 0\).
The product rule
Rearranging the conditional formula gives
The joint factors into a marginal and a conditional, in either order. This is the product rule of probability and it is the algebraic engine of Bayes' theorem.
You have a joint distribution \(p(x, y)\) given as a table. How do you read off the marginal \(p(x)\) and the conditional \(p(y \mid x)\)?
36. Bayes' theorem
Combine the product rule with itself: \(p(x, y) = p(x \mid y) p(y) = p(y \mid x) p(x)\). Solve for \(p(x \mid y)\):
That is Bayes' theorem. Three lines of algebra; an entire philosophy of inference.
The four pieces
In ML the formula is usually written with \(\theta\) for parameters and \(\mathcal{D}\) for data:
The posterior is what you want: the updated belief about parameters after seeing data. The likelihood is your model: how likely is this data, given particular parameters. The prior is your belief before seeing the data. The evidence is the marginal probability of the data; it is the normalising constant that makes the posterior a proper distribution.
The evidence is often the hardest piece to compute. For most non-trivial models it requires an integral over all possible parameter values, which is intractable in closed form. This is why so much of probabilistic ML is about approximating the evidence (variational inference, MCMC, etc.).
The classical example: medical testing
A test for a rare disease has 99% sensitivity (true positive rate) and 99% specificity (true negative rate). The disease has prevalence 0.1% in the population. You test positive. What is the probability you have the disease?
Apply Bayes:
Nine percent. Even with a 99%-accurate test, the rarity of the disease means most positives are false positives. This counterintuitive answer is the canonical example of why prior matters and why Bayesian reasoning is non-negotiable in medical and forensic settings.
Re-do the medical test computation if the disease has 10% prevalence instead of 0.1%. What is \(P(D \mid +)\) now?
37. Expectation, variance, covariance
Three numerical summaries of a random variable that you should be able to compute in your sleep.
Expectation
The expectation (or mean) of a random variable \(X\) is its weighted average:
It is the centre of mass of the distribution. It is linear: \(\mathbb{E}[aX + bY] = a \mathbb{E}[X] + b \mathbb{E}[Y]\) for any random variables \(X\) and \(Y\) and any scalars \(a, b\). Linearity of expectation does not require independence; it is one of the cleanest tools in all of probability.
Variance
The variance measures spread:
The square root is the standard deviation, which has the same units as \(X\). Variance is non-linear: \(\text{Var}(aX) = a^2 \text{Var}(X)\), and \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2 \text{Cov}(X, Y)\).
Covariance
The covariance of two random variables is
It measures how much they move together. Positive covariance means \(X\) and \(Y\) tend to be both above their means together (and both below). Negative covariance means they tend to move oppositely. Zero covariance means they are uncorrelated, but not necessarily independent.
The covariance matrix
For a random vector \(\mathbf{X} = (X_1, \ldots, X_n)^\top\), the covariance matrix \(\boldsymbol{\Sigma}\) has entries \(\Sigma_{ij} = \text{Cov}(X_i, X_j)\). It is symmetric (covariance is symmetric in its arguments) and positive semidefinite (any quadratic form \(\mathbf{a}^\top \boldsymbol{\Sigma} \mathbf{a} = \text{Var}(\mathbf{a}^\top \mathbf{X}) \ge 0\)). The diagonal entries are the variances of each component; the off-diagonal entries are the covariances between pairs.
Covariance matrices are the central object of multivariate statistics. PCA finds their eigendecomposition. Gaussian processes parameterise their covariance with a kernel function. The Mahalanobis distance is a quadratic form using the inverse covariance.
Roll a fair die. What is the expectation? What is the variance?
38. Independence
Two random variables \(X\) and \(Y\) are independent if knowing one tells you nothing about the other. Formally:
Equivalently \(p(x \mid y) = p(x)\), the conditional equals the marginal. Several properties follow:
- \(\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]\). Expectations of products factor for independent random variables.
- \(\text{Cov}(X, Y) = 0\). Independent implies uncorrelated.
- \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\). Variances add.
The reverse implications are not all true. Uncorrelated does not imply independent in general. They are equivalent only for jointly Gaussian random variables, which is one of the things that makes Gaussians so pleasant.
Conditional independence
\(X\) and \(Y\) are conditionally independent given \(Z\) if
Knowing \(Z\) makes \(X\) and \(Y\) independent of each other. This is weaker than full independence: \(X\) and \(Y\) might be correlated marginally, but the correlation is "explained" by \(Z\). Conditional independence assumptions are the structural backbone of graphical models (Bayesian networks and Markov random fields), where missing edges between nodes encode conditional independencies.
Why independence matters
Most ML training assumes the data points are iid: independent and identically distributed. This assumption simplifies the math (the likelihood factors into a product over points) and is approximately true for many real datasets. When it fails (time series, sequences with structure), you have to use models that explicitly account for the dependence (HMMs, RNNs, transformers).
Two coins are flipped independently. Let \(X\) be the result of the first, \(Y\) the result of the second, and \(Z = X \oplus Y\) (their XOR). Are \(X\) and \(Y\) independent? Are \(X\) and \(Z\)? Are \(X\) and \(Z\) given \(Y\)?
39. The Gaussian distribution
The most important continuous distribution in machine learning. Two reasons. First, it shows up naturally as the limit of sums (the central limit theorem). Second, it has unusually pleasant algebraic properties: it is closed under linear operations, conditioning, and marginalisation.
The univariate Gaussian
Two parameters: the mean \(\mu\) and the variance \(\sigma^2\). The density is symmetric about \(\mu\), peaks at \(\mu\), and decays exponentially in the squared distance. About 68% of the mass is within one standard deviation, 95% within two, 99.7% within three.
The multivariate Gaussian
For a random vector \(\mathbf{X} \in \mathbb{R}^n\):
Two parameters: a mean vector \(\boldsymbol{\mu} \in \mathbb{R}^n\) and a positive-definite covariance matrix \(\boldsymbol{\Sigma}\). The density's level sets are ellipsoids centered at \(\boldsymbol{\mu}\), with shape and orientation determined by the eigenvectors and eigenvalues of \(\boldsymbol{\Sigma}\).
Properties that make Gaussians special
- Linear transformations preserve Gaussianity. If \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) and \(\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}\), then \(\mathbf{Y} \sim \mathcal{N}(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^\top)\).
- Marginals are Gaussian. If you marginalise out some components of a multivariate Gaussian, the result is still Gaussian on the remaining components.
- Conditionals are Gaussian. Conditioning on a subset of components also gives a Gaussian, and there are explicit formulas for the conditional mean and covariance.
- Sums are Gaussian. If \(\mathbf{X}\) and \(\mathbf{Y}\) are independent Gaussians, \(\mathbf{X} + \mathbf{Y}\) is Gaussian.
No other distribution has this combination. It is why Gaussians dominate Kalman filters, Gaussian processes, mixture models, Bayesian linear regression, and most of probabilistic ML.
Sampling from a Gaussian
To sample \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\): sample \(\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) (each component an independent standard normal), then set \(\mathbf{X} = \boldsymbol{\mu} + \mathbf{L}\mathbf{Z}\) where \(\mathbf{L}\) is the Cholesky factor of \(\boldsymbol{\Sigma}\) (Module 20). The transformation has the right mean and covariance, and a linear function of a Gaussian is Gaussian.
For a multivariate Gaussian with diagonal covariance \(\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \ldots, \sigma_n^2)\), what does the level-set ellipsoid look like? What are the principal axes?
40. Conjugate priors and the exponential family
Bayesian inference asks you to compute the posterior \(p(\theta \mid \mathcal{D}) \propto p(\mathcal{D} \mid \theta) p(\theta)\). For arbitrary priors and likelihoods, this product gives a posterior in some unfamiliar functional form, and computing the normalising constant is hard. Conjugate priors are special prior choices that make the posterior have the same functional form as the prior.
Bernoulli-Beta
The classical example. If your data is Bernoulli with parameter \(\theta\), and you put a Beta\((\alpha, \beta)\) prior on \(\theta\), the posterior is Beta\((\alpha + s, \beta + n - s)\), where \(n\) is the number of trials and \(s\) is the number of successes. The posterior is just a Beta with updated parameters; you do not have to compute any integrals.
Gaussian-Gaussian
If your data is Gaussian with known variance and unknown mean, and you put a Gaussian prior on the mean, the posterior is Gaussian. The Gaussian is conjugate to itself (with respect to the mean parameter). This is what makes Bayesian linear regression analytically tractable (Module 56).
The exponential family
These pleasant pairings are not coincidences. They are special cases of a deeper structure. A distribution is in the exponential family if its density can be written as
where \(\boldsymbol{\eta}\) are the natural parameters, \(\mathbf{T}(\mathbf{x})\) are the sufficient statistics, \(h(\mathbf{x})\) is a base measure, and \(A(\boldsymbol{\eta})\) is the log-normaliser. Bernoulli, Binomial, Poisson, Gaussian, Beta, Gamma, Exponential, Categorical, almost every named distribution is in this family.
Every exponential family has a conjugate prior, with parameters that look like "pseudo-counts" of pseudo-observations. The posterior then has the same form, with the parameters incremented by the actual data's sufficient statistics. This unification is the deep reason the conjugate-pair magic works.
You do not have to memorise the formal machinery to use it. What you need to know: when designing a probabilistic model, picking priors that are conjugate to your likelihood gives you closed-form updates and makes everything orders of magnitude easier. When the conjugate prior is unsuitable, you abandon analytic tractability and reach for numerical methods (variational inference, MCMC).
You observe 7 heads in 10 coin flips. With a Beta\((1, 1)\) prior on the bias \(\theta\), what is the posterior?
41. Change of variables
You have a random variable \(X\) with density \(p_X(x)\). You define a new random variable \(Y = f(X)\) for some smooth invertible function \(f\). What is the density of \(Y\)?
For a single variable, the answer is
The density at \(y\) equals the density at the corresponding \(x\) value, multiplied by a Jacobian factor that accounts for how much the transformation stretches or squishes space. If \(f\) compresses a region of \(x\) into a smaller region of \(y\), the density goes up to compensate.
Multidimensional version
For \(\mathbf{Y} = \mathbf{f}(\mathbf{X})\) with \(\mathbf{f}\) invertible, the same formula holds with the absolute determinant of the Jacobian:
Equivalently, using the forward Jacobian: \(p_\mathbf{Y}(\mathbf{y}) = p_\mathbf{X}(\mathbf{x}) / |\det \mathbf{J}_{\mathbf{f}}(\mathbf{x})|\), where \(\mathbf{x} = \mathbf{f}^{-1}(\mathbf{y})\).
Why this matters
- Reparameterisation tricks in variational inference and normalising flows are all change-of-variable formulas applied to neural-network-shaped \(\mathbf{f}\).
- Inverse transform sampling: to sample from any distribution, sample uniformly from \([0, 1]\) and apply the inverse CDF. The change-of-variable formula proves that this works.
- Jacobian determinants in deep learning appear when training generative models that learn invertible transformations.
That closes Part V. You can now talk about random variables, joint distributions, conditional distributions, expectations, and the Gaussian. Part VI puts the calculus from Part IV and the geometry from Parts I-III to work in the form of optimisation algorithms.
If \(X \sim \text{Uniform}[0, 1]\) and \(Y = -\log(X)\), what is the density of \(Y\)?
Part VI. Continuous optimisation
Finding the lowest point on a landscape.
Training a machine learning model is almost always an optimisation problem: find the parameters that minimise a loss. The loss is a function from a parameter space to a real number, and you want the input that produces the smallest output. This part covers how to do that systematically.
42. The optimisation problem
The standard form of a continuous optimisation problem is
You want to minimise an objective function \(f\) over some feasible set defined by inequality constraints \(g_i\) and equality constraints \(h_j\). When there are no constraints, the problem is unconstrained, which is the simplest case and the focus of the next two modules.
Minima vs maxima
Maximising \(f\) is the same as minimising \(-f\). All conventions in this part assume minimisation; you flip a sign for the maximisation case.
Local vs global
A local minimum is a point \(\mathbf{x}^*\) where \(f(\mathbf{x}^*) \le f(\mathbf{x})\) for all \(\mathbf{x}\) in some neighbourhood. A global minimum is one where this holds over the entire feasible set. For convex problems (Module 46), every local minimum is global. For non-convex problems (most of deep learning), local minima can be very different from the global minimum, and finding the latter is in general impossible.
In practice, "good local minimum" is the realistic goal for non-convex training. There is empirical and theoretical evidence that for over-parameterised neural networks, most local minima are roughly equally good, which is one reason gradient descent works as well as it does.
First-order optimality
For a smooth unconstrained \(f\), a necessary condition for \(\mathbf{x}^*\) to be a local minimum is
The gradient must vanish. This is necessary but not sufficient, saddle points and maxima also satisfy it. To distinguish, you need second-order information (the Hessian, Module 30).
Sometimes you can solve \(\nabla f = \mathbf{0}\) in closed form. For least-squares regression, you can. For most ML problems, you cannot, and you reach for iterative methods.
For \(f(x) = x^4 - 2x^2\), find all critical points by solving \(f'(x) = 0\). Which are minima, maxima, or saddles?
43. Gradient descent
The single most important optimisation algorithm in ML. The idea is direct: at any point, the gradient points in the direction of steepest ascent. To go down, take a step in the opposite direction.
The scalar \(\alpha > 0\) is the learning rate (or step size). Choosing it well is half the work.
What makes a good learning rate
Too small: the algorithm crawls; you waste compute and may not reach a minimum in any reasonable time. Too large: the algorithm overshoots and diverges, with iterates spiralling outward forever. Just right: you make steady progress without overshooting.
For convex quadratic functions, the optimal fixed learning rate is \(\alpha = 2 / (\lambda_{\max} + \lambda_{\min})\) where \(\lambda\)'s are eigenvalues of the Hessian. Convergence is at a rate determined by the condition number \(\kappa = \lambda_{\max} / \lambda_{\min}\), high condition number means slow convergence. This is the source of the famous "narrow valley" problem: gradient descent zigzags inefficiently in long narrow minima.
Stochastic gradient descent
For ML problems where the loss is a sum over many data points (which is essentially all of them), computing the full gradient at each step is prohibitively expensive when there are millions of points. Stochastic gradient descent (SGD) replaces the full gradient with a gradient computed on a single random data point (or a small batch):
Each step is much cheaper. The trade-off: each step is noisier, so you take more steps to reach the same accuracy. In practice, SGD with mini-batches (\(B\) points instead of 1) hits the sweet spot for hardware: enough work to amortise overhead, small enough to keep variance manageable.
Why SGD often beats GD
Counterintuitively, the noise in SGD can help. It acts as a form of regularisation, preventing the algorithm from settling into sharp local minima that generalise poorly. SGD also gets unstuck from saddle points (which are abundant in high-dimensional non-convex losses) more easily than full-batch GD. Both effects are why SGD remains the dominant training algorithm for neural networks despite seeming "worse" on paper.
What happens to gradient descent on a 1D function \(f(x) = x^2\) if you set \(\alpha = 1.5\)? Trace a few iterates by hand.
44. Step-size schedules and momentum
Plain SGD with a fixed learning rate is rarely the best you can do. Two refinements show up in nearly every modern optimiser.
Decaying step size
You start with a large learning rate to make progress fast, then shrink it over time so the algorithm settles. Common schedules:
- Step decay. Multiply \(\alpha\) by 0.1 every \(K\) epochs.
- Polynomial decay. \(\alpha_k = \alpha_0 (1 + k/T)^{-p}\).
- Cosine annealing. \(\alpha\) follows a cosine curve from \(\alpha_0\) to 0 over training. Surprisingly effective.
- Warm restart. Periodically reset \(\alpha\) to its initial value and decay again. Helps escape local minima.
Momentum
Plain gradient descent zigzags in narrow valleys because the gradient direction can flip from step to step. Momentum averages over recent gradients to smooth this out:
The "velocity" \(\mathbf{v}\) accumulates an exponentially weighted average of past gradients. The hyperparameter \(\beta \in [0, 1)\) controls how much memory it has; \(\beta = 0.9\) is typical. Physically, you can think of the parameter as a heavy ball rolling down the loss surface: it builds up speed in the average gradient direction and is less perturbed by individual noisy gradients.
Adam
The current default optimiser for almost everything: Adam combines momentum with a per-parameter scaling that adapts based on the magnitude of recent gradients. The update is
(Plus a bias correction in the early iterations.) The first moment \(\mathbf{m}\) is momentum. The second moment \(\mathbf{v}\) tracks the variance of recent gradients per parameter. Dividing by \(\sqrt{\mathbf{v}}\) gives parameters with consistently large gradients smaller updates, and parameters with small gradients larger updates. The end effect: Adam works well with relatively little tuning across a wide range of problems, which is why it dominates.
Variants worth knowing exist (RMSProp, AdamW, AdaGrad, NAdam) but the core ideas are momentum + per-parameter scaling.
Why does momentum help in narrow valleys? Sketch the trajectory of plain GD vs GD-with-momentum on an elongated quadratic.
45. Constrained optimisation
What if you have constraints? You want to minimise \(f(\mathbf{x})\) subject to \(g_i(\mathbf{x}) \le 0\) and \(h_j(\mathbf{x}) = 0\). Naive gradient descent can violate the constraints.
Equality constraints and Lagrange multipliers
For pure equality constraints, the classical method is Lagrange multipliers. Form the Lagrangian:
The optimum lies at a stationary point of \(\mathcal{L}\) where \(\nabla_\mathbf{x} \mathcal{L} = \mathbf{0}\) and \(\nabla_{\boldsymbol{\lambda}} \mathcal{L} = \mathbf{0}\) (the latter just restates \(h_j = 0\)). Geometrically: at an optimum, the gradient of \(f\) is a linear combination of the gradients of the active constraints. The constraints "pull" the optimum off the unconstrained gradient direction.
Inequality constraints and KKT
For inequality constraints, you build a Lagrangian with a multiplier per constraint, but the conditions are richer. They are the Karush-Kuhn-Tucker (KKT) conditions, covered in Module 47.
Practical approaches
For modern ML, three workhorses:
- Penalties. Add a penalty term to the objective that grows when constraints are violated. \(f(\mathbf{x}) + \mu \sum_j h_j(\mathbf{x})^2\). Drive \(\mu \to \infty\) and you recover the constrained problem. Easy to implement; can be ill-conditioned at high \(\mu\).
- Projected gradient descent. After each gradient step, project the result back onto the feasible set. Works when the projection is cheap (e.g., box constraints, simplex constraints).
- Augmented Lagrangian / interior point methods. More sophisticated; the basis of solvers in convex optimisation libraries.
For ML specifically, most problems are unconstrained or have box constraints (e.g., a probability simplex), and these tools are sufficient.
Minimise \(f(x, y) = x^2 + y^2\) subject to \(x + y = 1\) using Lagrange multipliers.
46. Convex sets and convex functions
Convexity is the property that distinguishes "easy" optimisation problems from "hard" ones.
Convex sets
A set \(C\) is convex if for every two points \(\mathbf{x}, \mathbf{y} \in C\) and every \(\theta \in [0, 1]\), the point \(\theta \mathbf{x} + (1-\theta) \mathbf{y}\) is also in \(C\). The line segment between any two points stays inside.
Examples: \(\mathbb{R}^n\) itself. Half-spaces \(\{\mathbf{x} : \mathbf{a}^\top \mathbf{x} \le b\}\). Balls. Ellipsoids. Polyhedra. The intersection of convex sets is convex (this is how complex feasible sets get built up). The union of two convex sets is generally not convex.
Convex functions
A function \(f\) is convex if its domain is convex and
for all \(\mathbf{x}, \mathbf{y}\) in the domain and \(\theta \in [0, 1]\). Geometrically: the chord between any two points on the graph lies above the graph itself. The function curves upward (or stays linear).
Equivalent characterisations for twice-differentiable \(f\):
- The Hessian is positive semidefinite everywhere.
- The graph lies above every tangent plane: \(f(\mathbf{y}) \ge f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x})\) for all \(\mathbf{x}, \mathbf{y}\).
Why convexity matters
For a convex function over a convex set, two beautiful facts hold:
- Every local minimum is a global minimum. No need to worry about getting stuck.
- The set of global minima is itself convex. If there are multiple global minima, the line segment between any two of them also consists of global minima.
Algorithms have provable convergence guarantees on convex problems. Linear programming, quadratic programming, convex SVMs, logistic regression, all are convex and all admit efficient, reliable algorithms.
Most deep learning losses are not convex, and that is why theory there is much harder. But many sub-problems within ML pipelines are convex (e.g., per-layer least-squares solves), and those benefit from the full power of convex optimisation theory.
Is \(f(x) = e^x\) convex? Is \(f(x) = -\log(x)\) on \(x > 0\) convex? Is \(f(x) = x^4 - x^2\) convex?
47. KKT conditions
For a constrained optimisation problem with both inequality constraints \(g_i(\mathbf{x}) \le 0\) and equality constraints \(h_j(\mathbf{x}) = 0\), form the Lagrangian
The KKT conditions are necessary conditions for \(\mathbf{x}^*\) to be a local minimum (under mild constraint qualifications). They are:
- Stationarity. \(\nabla_\mathbf{x} \mathcal{L}(\mathbf{x}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*) = \mathbf{0}\).
- Primal feasibility. \(g_i(\mathbf{x}^*) \le 0\) and \(h_j(\mathbf{x}^*) = 0\) for all \(i, j\).
- Dual feasibility. \(\mu_i^* \ge 0\) for all \(i\). (Equality multipliers \(\lambda_j\) can be any sign.)
- Complementary slackness. \(\mu_i^* g_i(\mathbf{x}^*) = 0\) for all \(i\). Either the constraint is tight (\(g_i = 0\)) or the multiplier is zero.
Complementary slackness is the key insight. It says that an inactive constraint (one that is not tight at the optimum) contributes nothing to the optimum. The only constraints that "matter" are the active ones, and their multipliers measure how much they are pulling.
Why KKT shows up in SVMs
In support vector machines (Part XI), the optimum is characterised by KKT conditions. Most data points have zero multiplier (they are away from the margin and their constraints are inactive). Only the points right on the margin, the support vectors, have non-zero multipliers and contribute to the decision boundary. This is the entire reason the algorithm is called "support vector" machine, and the structural insight comes from KKT.
Sufficient conditions
For convex problems, KKT is also sufficient: any point satisfying KKT is a global minimum. For non-convex problems, KKT is only necessary; you also need to check second-order conditions to confirm a minimum.
That closes Part VI. You now have the optimisation toolbox that almost every ML training algorithm sits on top of. The next part connects the dots: how do you go from "I have data and a question" to "I have a model and an optimisation problem to solve."
For the constrained problem "minimise \(x^2 + y^2\) subject to \(x + y \ge 1\)," write down the KKT conditions and solve them.
Part VII. When models meet data
From assumption to answer.
A machine learning algorithm is a procedure that takes a dataset and produces a model. The bridge between the two is a probabilistic story about how the data was generated, plus a principle for picking the parameters that best explain it. Maximum likelihood, maximum a posteriori, and full Bayesian inference are three increasingly cautious answers to the same question.
48. Models, data, parameters
A model is a parametric family of probability distributions. Each setting of the parameters \(\boldsymbol{\theta}\) gives one specific distribution \(p(\mathbf{x} \mid \boldsymbol{\theta})\). The job of training is to find the \(\boldsymbol{\theta}\) that best fits the data.
Examples to keep in mind:
- A Gaussian model with unknown mean and variance. Parameters: \(\boldsymbol{\theta} = (\mu, \sigma^2)\). Each setting gives a different bell curve.
- A linear regression model. Parameters: weights \(\mathbf{w}\) and noise variance \(\sigma^2\). Each setting gives a different line plus Gaussian noise.
- A neural network. Parameters: all the weights and biases of all the layers, often millions of numbers. Each setting gives a different function from inputs to outputs.
Once you have a parametric family and a dataset \(\mathcal{D} = \{\mathbf{x}_1, \ldots, \mathbf{x}_N\}\), three principles give you three different ways to pick the parameters.
iid assumption
Almost universally, you assume the data points are iid: drawn independently from the same distribution. Under this assumption, the joint likelihood factors:
Taking a log turns the product into a sum, which is much easier to differentiate and optimise:
This is the log-likelihood and it is the function you will spend most of ML maximising or minimising the negative of.
Why is it almost always more convenient to work with the log of the likelihood than the likelihood itself?
49. Maximum likelihood
The simplest principle for choosing parameters: pick the \(\boldsymbol{\theta}\) that makes the observed data most likely.
This is maximum likelihood estimation (MLE). Pick the parameters under which the data is least surprising. It is what almost every ML training procedure does, even when not framed this way explicitly.
Examples
MLE for a coin's bias. Observe \(s\) heads in \(n\) flips. Likelihood: \(\theta^s (1-\theta)^{n-s}\). Take log, differentiate, set to zero, solve. Answer: \(\theta_{\text{MLE}} = s/n\). The empirical fraction. The MLE is the obvious thing.
MLE for the mean and variance of a Gaussian. Likelihood: product of \(\mathcal{N}(x_n \mid \mu, \sigma^2)\). Maximise. Answers: \(\mu_{\text{MLE}}\) is the sample mean, \(\sigma^2_{\text{MLE}}\) is the sample variance (using \(N\) in the denominator, not \(N-1\)).
MLE for linear regression with Gaussian noise. Maximising the likelihood is exactly minimising the squared-error loss. The MLE solution is the least-squares solution. We will work this out fully in Part VIII.
Why MLE is the default
Three properties make MLE attractive:
- Consistency. Under standard conditions, as \(N \to \infty\), \(\boldsymbol{\theta}_{\text{MLE}}\) converges to the true parameter (assuming the model is correctly specified).
- Asymptotic efficiency. Among all consistent estimators, MLE has the smallest variance for large \(N\). It is "as good as it gets" in the limit.
- Composability. The MLE for a complex model often decomposes into MLEs for sub-models, which makes it tractable to derive.
Where MLE fails
For small datasets, MLE can severely overfit. With one coin flip showing heads, the MLE bias is 1.0: the coin is certain to come up heads forever. This is obviously absurd, and the next module fixes it by adding a prior.
Derive the MLE for the rate parameter \(\lambda\) of a Poisson distribution given counts \(x_1, \ldots, x_N\).
50. Maximum a posteriori
Bayes' theorem says you can combine prior beliefs with the likelihood to get a posterior. The simplest Bayesian-flavoured estimator picks the mode of the posterior:
The evidence \(p(\mathcal{D})\) is constant in \(\boldsymbol{\theta}\), so it does not affect the argmax. Maximising the posterior is maximising likelihood times prior, or equivalently maximising log-likelihood plus log-prior.
MAP as regularised MLE
The deep insight: MAP is just MLE with a regularisation term. The log of the prior contributes a penalty on extreme parameter values, pulling the estimate toward whatever the prior considers plausible.
For example: if you put a Gaussian prior \(\boldsymbol{\theta} \sim \mathcal{N}(\mathbf{0}, \tau^2 \mathbf{I})\) on the parameters of any model, the log-prior is \(-\|\boldsymbol{\theta}\|^2 / (2\tau^2)\) plus a constant. MAP becomes
That is exactly L2 regularisation. Ridge regression is MAP linear regression with a Gaussian prior. Weight decay in neural networks is MAP under a Gaussian prior. The "regularisation strength" is set by the prior variance \(\tau^2\): tighter prior means stronger regularisation.
Similarly, an L1 (Laplace) prior on the parameters yields LASSO regression and sparsity. Different priors correspond to different regularisers, and the regulariser can always be read off as the negative log-prior.
Limitations
MAP gives a single point estimate. It tells you the most likely parameter, but not how uncertain you are. For a multi-modal or skewed posterior, the MAP point may not be representative of where the mass is. For full uncertainty quantification you need the next module.
For the coin-flipping problem with a Beta\((\alpha, \beta)\) prior, derive the MAP estimate of \(\theta\). How does it compare to the MLE \(s/n\)?
51. Bayesian inference, the right way
MLE gives a single point. MAP gives a single point with a prior. Full Bayesian inference gives the entire posterior distribution \(p(\boldsymbol{\theta} \mid \mathcal{D})\) and uses it for everything.
Predictions
For a new input \(\mathbf{x}^*\), the predictive distribution is
You weight each parameter setting by its posterior probability and average the predictions. This is conceptually clean: the answer accounts for parameter uncertainty automatically. A model that is unsure about its parameters will have wider predictive intervals.
Why we do not always do this
The integral above is intractable for almost every interesting model. Two routes:
- Conjugate priors. When you can choose a prior that is conjugate to the likelihood (Module 40), the posterior is in the same family as the prior, and the predictive integral often has a closed form. Bayesian linear regression (Module 56) is the textbook example.
- Approximate inference. When closed forms are unavailable, you approximate. Two main families: Markov chain Monte Carlo (MCMC) draws samples from the posterior; variational inference fits a tractable distribution to approximate the posterior. Both are full subjects in their own right.
The point estimates revisited
MLE is what you get if you take the limit of MAP as the prior becomes uninformative (very wide). MAP is the mode of the posterior. The full Bayesian predictive is the average over the posterior. As the dataset grows, all three converge to the same answer (under standard regularity conditions). For small data, they can differ substantially, and the full Bayesian answer is generally the right one if you can afford to compute it.
That closes Part VII. The framework is now in place: data, model, principle. The next four parts each apply this framework to one specific problem: regression, dimensionality reduction, density estimation, and classification.
Why does the Bayesian predictive distribution typically have wider tails than the MAP-based plug-in prediction? When is this a feature, not a bug?
Part VIII. Linear regression
The simplest predictor.
Linear regression is the model where every other ML algorithm should be compared. It is fast, has a closed-form solution, has a clean Bayesian extension, and fits real problems surprisingly often. Understanding it deeply is more useful than learning twenty fancier models superficially.
52. Setting up the problem
You have \(N\) data points, each consisting of an input vector \(\mathbf{x}_n \in \mathbb{R}^D\) and a real-valued target \(y_n\). You want a function \(f(\mathbf{x})\) that predicts \(y\) from \(\mathbf{x}\). Linear regression chooses \(f\) to be a linear function of the inputs:
The weight vector \(\mathbf{w} \in \mathbb{R}^D\) and the bias \(b \in \mathbb{R}\) are the parameters. To absorb the bias into the weight vector, you append a constant 1 to every input: \(\tilde{\mathbf{x}} = (\mathbf{x}^\top, 1)^\top\). Then \(f(\mathbf{x}) = \mathbf{w}^\top \tilde{\mathbf{x}}\) with the bias as the last entry of \(\mathbf{w}\). For the rest of this part, we will work with this form and drop the tilde.
Stacking the data
Pack the inputs into the rows of a design matrix \(\mathbf{X} \in \mathbb{R}^{N \times D}\) and the targets into a vector \(\mathbf{y} \in \mathbb{R}^N\). The vector of predicted outputs is
The residuals are \(\mathbf{r} = \mathbf{y} - \hat{\mathbf{y}}\). The squared error loss is
Linear regression in its simplest form is: pick \(\mathbf{w}\) to minimise this loss. The minimum is called the least squares estimate.
Why linear?
Linear models have three things going for them. First, the optimisation has a closed-form solution (no iteration needed). Second, they admit a clean probabilistic interpretation as Gaussian likelihood. Third, they are the building block of nearly every more sophisticated model: a neural network is essentially many small linear regressions stacked with non-linearities between them.
If you have 100 data points each with 5 features, what is the shape of the design matrix \(\mathbf{X}\)? What about the weight vector \(\mathbf{w}\) (with the bias absorbed)?
53. Least squares as MLE
The squared error loss is not chosen arbitrarily. It comes from a probabilistic story.
Assume the targets are generated by
The targets are linear in the inputs plus Gaussian noise. The likelihood of one data point is
Take the log of the iid joint likelihood:
Maximising this with respect to \(\mathbf{w}\) is equivalent to minimising the squared error sum, since everything else is constant in \(\mathbf{w}\). So least squares is maximum likelihood under Gaussian noise. The squared error loss is not a free parameter; it is the consequence of the Gaussian assumption.
This connects Part VII directly to Part VIII. If your noise model is something other than Gaussian, the appropriate loss is something other than squared error. Laplace noise gives \(\ell_1\) loss (least absolute deviations). Heavy-tailed noise (e.g., Student's t) gives a robust loss that down-weights outliers. The loss is not arbitrary; it encodes a noise assumption.
If you assume the noise is Laplace distributed (heavy-tailed) instead of Gaussian, what loss function do you minimise? Why does it correspond to a different fit when the data has outliers?
54. The normal equations
Minimise \(L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2\) by setting the gradient to zero.
Rearrange to get
These are the normal equations. When \(\mathbf{X}^\top \mathbf{X}\) is invertible (which happens when \(\mathbf{X}\) has full column rank, equivalently when the features are not perfectly collinear and \(N \ge D\)), the solution is
Closed form. No iteration, no learning rate. One linear-algebra computation gives you the optimum.
Numerical caution
You almost never compute \((\mathbf{X}^\top \mathbf{X})^{-1}\) explicitly. The matrix \(\mathbf{X}^\top \mathbf{X}\) has condition number equal to the squared condition number of \(\mathbf{X}\), which can be very large and lose precision. The numerically stable approach is to solve the normal equations using QR decomposition of \(\mathbf{X}\) or, even better, SVD.
If \(\mathbf{X} = \mathbf{Q}\mathbf{R}\) (QR decomposition), the solution is \(\mathbf{w}^* = \mathbf{R}^{-1} \mathbf{Q}^\top \mathbf{y}\), computed by triangular back-substitution. If \(\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\) (SVD), the solution is \(\mathbf{w}^* = \mathbf{V}\mathbf{\Sigma}^+ \mathbf{U}^\top \mathbf{y}\), where \(\mathbf{\Sigma}^+\) is the pseudoinverse (transpose with non-zero singular values inverted). The SVD form also handles the rank-deficient case, returning the minimum-norm solution.
When N is much smaller than D
If you have more features than data points (\(D > N\)), the matrix \(\mathbf{X}^\top \mathbf{X}\) is rank-deficient and the normal equations have infinitely many solutions. This is the under-determined regime, which is where modern ML lives (over-parameterised neural networks). The cure is regularisation, which is the next module.
The cost of solving the normal equations exactly is \(O(D^2 N + D^3)\). For very large \(D\), this is impractical. Suggest an alternative algorithm.
55. Regularisation
For ill-posed problems (more features than data, or near-collinear features), the unregularised solution is unstable. Tiny changes in the data produce huge changes in \(\mathbf{w}\). The fix: add a penalty that discourages large weights.
Ridge regression (L2)
Add a squared-norm penalty:
The hyperparameter \(\lambda > 0\) controls the strength. The solution is
Adding \(\lambda \mathbf{I}\) to \(\mathbf{X}^\top \mathbf{X}\) makes the matrix invertible even when the original is rank-deficient. It also shrinks the weights toward zero, which reduces variance at the cost of introducing some bias. As \(\lambda \to 0\) you recover ordinary least squares; as \(\lambda \to \infty\) the weights go to zero.
Recall from Module 50: ridge regression is exactly MAP linear regression with a zero-mean Gaussian prior on the weights. The prior variance \(\tau^2\) and the regularisation strength relate as \(\lambda = \sigma^2 / \tau^2\).
LASSO (L1)
Replace the squared norm with an L1 norm:
The L1 penalty has a famous property: it produces sparse solutions. Many weights are pushed to exactly zero, which acts as automatic feature selection. The solution has no closed form but is found by efficient iterative algorithms (coordinate descent, FISTA).
The geometric reason for sparsity: the L1 ball has corners at the axes. Optimising a smooth loss subject to an L1 budget tends to land at one of those corners, where many coordinates are exactly zero.
Picking the regularisation strength
\(\lambda\) is a hyperparameter. You set it by cross-validation: try several values, evaluate held-out error, pick the one with the lowest.
If your dataset has 1000 features but only 100 data points, why is some kind of regularisation unavoidable for linear regression?
56. Bayesian linear regression
Linear regression with a Gaussian likelihood and a Gaussian prior on the weights is the canonical example of a fully analytically tractable Bayesian model.
The setup
Likelihood: \(p(y_n \mid \mathbf{x}_n, \mathbf{w}) = \mathcal{N}(y_n \mid \mathbf{w}^\top \mathbf{x}_n, \sigma^2)\). Prior: \(p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{0}, \tau^2 \mathbf{I})\). Both Gaussian.
The posterior
By Bayes' theorem and Gaussian conjugacy, the posterior is also Gaussian:
with
The posterior mean \(\boldsymbol{\mu}_N\) is the ridge solution. The posterior covariance \(\boldsymbol{\Sigma}_N\) tells you how uncertain you are about the weights, which the point estimate cannot give you.
The predictive distribution
For a new input \(\mathbf{x}^*\), the predictive distribution is
Two terms in the predictive variance: the second term \(\sigma^2\) is the irreducible noise. The first term \(\mathbf{x}^{*\top} \boldsymbol{\Sigma}_N \mathbf{x}^*\) is the additional uncertainty from not knowing the weights exactly. It is small near where the training data is dense and large in regions far from the data, exactly as intuition would suggest.
This is what makes Bayesian linear regression special. You get not just a prediction but a calibrated uncertainty estimate, and the math drops out cleanly because of Gaussian conjugacy.
Connection to Gaussian processes
Lift Bayesian linear regression to an infinite-dimensional feature space (using a kernel) and you get a Gaussian process: a Bayesian non-parametric regression model that is exact, has uncertainty estimates, and is the basis of much of probabilistic machine learning. The GP is essentially "Bayesian linear regression but in a feature space defined implicitly by a kernel."
For Bayesian linear regression, what happens to the posterior variance as you collect more data? Why?
57. Geometry of least squares
One last view of linear regression, the most elegant. The least-squares solution is exactly the orthogonal projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\).
Recall from Module 14: the projection of a vector \(\mathbf{y}\) onto the column space of a matrix \(\mathbf{X}\) (assumed full column rank) is
Notice this is exactly \(\mathbf{X} \mathbf{w}^*\), where \(\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}\) is the least-squares solution. So the predicted vector \(\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}^*\) is the projection of the targets onto the span of the features. The residual \(\mathbf{y} - \hat{\mathbf{y}}\) is orthogonal to that span.
Why this view matters
Once you see least squares as a projection, several facts become obvious:
- The residuals are orthogonal to the features. Each feature column of \(\mathbf{X}\) is in the column space, and the residual is orthogonal to the column space, so the inner product is zero. Equivalently \(\mathbf{X}^\top (\mathbf{y} - \hat{\mathbf{y}}) = \mathbf{0}\), which is just the normal equations again.
- Adding a feature can only decrease the training error. Adding a column to \(\mathbf{X}\) makes the column space larger, and projecting onto a larger space gets you closer (or equal) to \(\mathbf{y}\). This is one reason regularisation matters: the training error is monotone in features but the test error is not.
- The projection matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top\) is symmetric and idempotent (\(\mathbf{H}^2 = \mathbf{H}\)). It is called the "hat matrix" because it puts the hat on \(\mathbf{y}\): \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\).
That closes Part VIII. Linear regression is the simplest predictor. The next part takes the same techniques to a different problem: not predicting a target, but finding low-dimensional structure in the data itself.
If you fit ordinary least squares with one feature column \(\mathbf{x}\), the residuals satisfy \(\mathbf{x}^\top \mathbf{r} = 0\). Why does this make geometric sense?
Part IX. Dimensionality reduction with PCA
Finding the directions that matter.
High-dimensional data is hard to look at, hard to store, hard to fit models to, and full of noise. Principal Component Analysis is the simplest tool for reducing dimensionality without losing too much. It is a one-line algorithm with two equivalent derivations and a beautiful connection to the SVD.
58. The dimensionality problem
You have \(N\) data points in \(\mathbb{R}^D\), where \(D\) is large (hundreds, thousands, sometimes millions). Three things go wrong as \(D\) grows:
- Computational cost. Most algorithms scale at least linearly in \(D\), and many scale superlinearly. Operating on raw 1024×1024 images (\(D = 10^6\)) is impractical without reducing them somehow.
- Statistical cost. The "curse of dimensionality", in high dimensions, every point is far from every other, distances become uninformative, and you need exponentially more data to fill the space.
- Visualisation. You cannot see in 100 dimensions. Two or three is the limit.
The intuition behind PCA: real high-dimensional data usually lives near a much lower-dimensional subspace. A photograph of a face has 1024×1024 = 1M pixels, but the space of "actually plausible face images" is much smaller, perhaps a few hundred dimensions. Find that subspace and you compress the data without losing what matters.
What "matters" can mean
Two reasonable definitions of "the most important dimensions":
- Variance. Keep the directions along which the data varies most. Discarding low-variance directions throws away components that look like constants, which by definition carry little information.
- Reconstruction error. Project the data onto a low-dimensional subspace and ask: how close is the projection to the original? Pick the subspace that minimises this distance.
Surprisingly, both definitions give the same answer. We will derive it from each angle in the next two modules.
If your data has 1000 features but most are constant or near-constant, how many "real" dimensions does it have? What would PCA do with such data?
59. PCA as variance maximisation
Center the data so it has zero mean. Then ask: what direction \(\mathbf{u}_1\) (with \(\|\mathbf{u}_1\| = 1\)) maximises the variance of the projected data?
The projection of a centered data point \(\mathbf{x}\) onto \(\mathbf{u}_1\) is the scalar \(\mathbf{u}_1^\top \mathbf{x}\). The variance of this scalar across the dataset is
where \(\mathbf{S} = \frac{1}{N} \mathbf{X}^\top \mathbf{X}\) is the sample covariance matrix (with the data centered). Maximise this subject to \(\|\mathbf{u}_1\| = 1\) using Lagrange multipliers (Module 45):
Setting \(\partial \mathcal{L} / \partial \mathbf{u}_1 = 0\) gives \(\mathbf{S}\mathbf{u}_1 = \lambda \mathbf{u}_1\). The optimal \(\mathbf{u}_1\) is an eigenvector of \(\mathbf{S}\), and the variance it captures is the corresponding eigenvalue \(\lambda\). To maximise variance, pick the eigenvector with the largest eigenvalue.
The next directions
To find the direction that captures the second-most variance, you constrain it to be orthogonal to \(\mathbf{u}_1\). Repeating the Lagrangian argument, you find that \(\mathbf{u}_2\) is the eigenvector of \(\mathbf{S}\) with the second-largest eigenvalue. By induction, the \(k\)-th principal direction is the \(k\)-th eigenvector.
The algorithm
PCA in three lines:
- Center the data.
- Compute the covariance matrix \(\mathbf{S} = \frac{1}{N} \mathbf{X}^\top \mathbf{X}\).
- Take its eigendecomposition. The eigenvectors are the principal components, ordered by decreasing eigenvalue.
To project to \(K\) dimensions, stack the top \(K\) eigenvectors as columns of a matrix \(\mathbf{U}_K\) and compute \(\mathbf{Z} = \mathbf{X}\mathbf{U}_K\). The reduced representation is \(\mathbf{Z} \in \mathbb{R}^{N \times K}\).
If your data has covariance matrix \(\mathbf{S} = \begin{pmatrix} 4 & 0 \\ 0 & 1 \end{pmatrix}\), what are the principal components? What variance does each capture?
60. PCA as reconstruction error
The other derivation: pick a \(K\)-dimensional subspace such that projecting data onto it loses as little as possible.
Given an orthonormal basis \(\mathbf{u}_1, \ldots, \mathbf{u}_K\) of the candidate subspace, the projection of \(\mathbf{x}\) is
The squared reconstruction error of one point is \(\|\mathbf{x} - \tilde{\mathbf{x}}\|^2\). Summed over the dataset and minimised over the choice of basis, you get the same answer as variance maximisation.
Why the two views agree
Consider the identity
(the Pythagorean theorem, since \(\tilde{\mathbf{x}}\) and \(\mathbf{x} - \tilde{\mathbf{x}}\) are orthogonal). Sum over the dataset:
The left side is fixed. So minimising the reconstruction error is equivalent to maximising the variance of the projections. Same problem, two framings.
The reconstruction
To reconstruct in the original space, multiply the reduced coordinates by the basis matrix:
The map \(\mathbf{X} \to \tilde{\mathbf{X}}\) is the orthogonal projection onto the top-\(K\) principal subspace. \(\tilde{\mathbf{X}}\) is the rank-\(K\) approximation of \(\mathbf{X}\) (Module 22), and the singular values squared are the eigenvalues of the covariance.
If you keep only the first 2 principal components of 100-dimensional data, what fraction of the total variance do you preserve, in terms of the eigenvalues?
61. Computing PCA
Two ways to compute PCA in practice. Both give the same answer; one is faster in different regimes.
Eigendecomposition of the covariance matrix
The textbook approach. Cost: \(O(ND^2)\) to form \(\mathbf{S}\), then \(O(D^3)\) to eigendecompose. Good when \(D\) is small and \(N\) is large.
SVD of the data matrix
Take the SVD of the centered data: \(\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\). The principal components are the columns of \(\mathbf{V}\), ordered by singular value. The eigenvalues of \(\mathbf{S} = \mathbf{X}^\top \mathbf{X} / N\) are \(\sigma_i^2 / N\). Cost: \(O(\min(N D^2, D N^2))\). Better numerical stability since you never form \(\mathbf{X}^\top \mathbf{X}\), which squares the condition number.
For very large \(N\) and \(D\), neither full method is feasible. Iterative methods (power iteration, Lanczos) compute just the top few eigenvectors without ever forming the full decomposition. Most numerical libraries default to one of these for large-scale PCA.
The high-dimensional small-sample case
If \(D \gg N\) (e.g., 10000 features, 100 samples), the covariance matrix \(\mathbf{S}\) is \(D \times D\) and rank-deficient. There is a clever trick: compute the eigendecomposition of \(\mathbf{X}\mathbf{X}^\top\) (which is \(N \times N\)) and recover the eigenvectors of \(\mathbf{S}\) from those. This is the Gram-trick form of PCA, sometimes called the "dual" form, and it is what kernel PCA generalises.
For a 1000-dimensional dataset with 50 samples, would you eigendecompose the \(1000 \times 1000\) covariance or the \(50 \times 50\) Gram matrix? Why?
62. PCA via SVD
The SVD relationship is so useful it deserves its own module.
Centred data matrix \(\mathbf{X} \in \mathbb{R}^{N \times D}\). SVD: \(\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\). The columns of \(\mathbf{V}\) are the principal directions. The diagonal of \(\mathbf{\Sigma}\) is the singular values \(\sigma_i\), and the variance captured by the \(i\)-th principal component is \(\sigma_i^2 / N\). The reduced representation is \(\mathbf{Z} = \mathbf{X}\mathbf{V}_K = \mathbf{U}_K \mathbf{\Sigma}_K\), where \(\mathbf{U}_K\) and \(\mathbf{\Sigma}_K\) are truncations to the top \(K\).
Why the SVD form is preferred
- Numerically more stable. Forming \(\mathbf{X}^\top \mathbf{X}\) and then eigendecomposing squares the condition number; SVD on \(\mathbf{X}\) directly does not.
- Handles non-square data and rank-deficient cases without special casing.
- Connects directly to low-rank approximation (Module 22): the top-\(K\) PCA reconstruction is the best rank-\(K\) approximation in Frobenius norm.
Whitening
Once you have the principal directions, you can whiten the data by additionally rescaling each component to unit variance:
Whitened data has identity covariance: each component has unit variance and the components are uncorrelated. Useful as a preprocessing step before models that benefit from feature standardisation.
If you whiten data using PCA and then apply a rotation, do you still have whitened data? Why or why not?
63. Choosing the number of components
The unanswered question of PCA: how many components to keep? There is no single right answer, but several useful heuristics.
Variance-explained threshold
Pick the smallest \(K\) such that the first \(K\) components capture at least, say, 90% or 95% of the total variance:
The threshold is arbitrary, but the principle is sound: keep enough components that the reconstruction error is acceptably small.
The scree plot
Plot the eigenvalues in decreasing order. Often there is a visible "elbow" where the eigenvalues drop sharply, suggesting the number of "real" components and a tail of "noise" components. Cut at the elbow.
Cross-validation on a downstream task
If PCA is a preprocessing step for some downstream model (regression, classification), pick the \(K\) that minimises held-out error of the downstream model. This is the most principled approach when you have a clear objective.
Probabilistic PCA and the marginal likelihood
For a more rigorous Bayesian treatment, you can frame PCA as a probabilistic latent-variable model (probabilistic PCA, or PPCA) and compute the marginal likelihood as a function of \(K\). The maximum is a principled choice. The model is closely related to factor analysis.
That closes Part IX. PCA is the simplest dimensionality reduction technique; many others build on the same core idea (kernel PCA, ICA, autoencoders) by changing the assumption of linearity or the criterion. The next part moves from finding structure in unlabelled data to fitting an explicit probabilistic model to it.
If your scree plot shows eigenvalues 100, 90, 80, 5, 4, 3, 2, 1, ..., what does that suggest about the structure of the data and the right \(K\)?
Part X. Density estimation with Gaussian mixtures
One distribution made of many.
A single Gaussian is too simple to model most real data. A mixture of Gaussians is, in a precise sense, a universal approximator: any reasonable density can be matched arbitrarily closely by enough Gaussians. Fitting one is a beautiful application of the Expectation-Maximisation algorithm.
64. Why mixtures
Real data is rarely unimodal. Heights of people show two modes (one for women, one for men). Customer behaviour clusters into segments. Pixel intensities of natural images form complicated multi-peaked distributions. A single Gaussian cannot represent any of this, and forcing it to fit gives a mean and variance that capture nothing.
A mixture model is a weighted sum of simpler distributions:
\[ p(\mathbf{x}) = \sum_{k=1}^K \pi_k \, p_k(\mathbf{x}), \quad \sum_k \pi_k = 1, \quad \pi_k \ge 0. \]
The \(\pi_k\) are mixture weights: probabilities that a draw comes from component \(k\). Each \(p_k(\mathbf{x})\) is a component density, often Gaussian. The mixture is more flexible than any single component and can capture multi-modal, skewed, and irregular distributions.
Why Gaussians for the components?
Three reasons. First, fitting a single Gaussian has a closed form (just sample mean and covariance), so each component is cheap. Second, the multivariate Gaussian is parameterised by a mean and a covariance, which gives enough flexibility to capture local structure without too many parameters. Third, with enough Gaussians, you can approximate any density arbitrarily well.
Mixtures of other distributions exist (mixtures of Bernoullis for binary data, mixtures of multinomials for topic modelling), but the Gaussian case is the textbook one.
Why is a single Gaussian insufficient to model a dataset that has two clear clusters?
65. The GMM density
A Gaussian mixture model with \(K\) components has the density
\[ p(\mathbf{x} \mid \boldsymbol{\theta}) = \sum_{k=1}^K \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k), \]
where \(\boldsymbol{\theta} = \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\}_{k=1}^K\) is the full parameter set: a weight, a mean, and a covariance matrix per component.
Fitting it: the obvious thing fails
You might think you could just take the log-likelihood and maximise it like you did for a single Gaussian. Try it:
\[ \log p(\mathbf{X} \mid \boldsymbol{\theta}) = \sum_{n=1}^N \log \!\left( \sum_{k=1}^K \pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right). \]
The log of a sum, not a sum of logs. The derivatives of this with respect to \(\boldsymbol{\mu}_k\) couple all the components together; you cannot solve the resulting equations in closed form. There is no analytic MLE for the GMM.
The latent variable trick
The way out is to introduce a latent (unobserved) discrete variable for each data point that says which component generated it. Conditioned on the latents, the problem decomposes into per-component Gaussian fits, which are easy. The trick is that we do not know the latents, so we have to estimate both the latents and the parameters together. That is what the EM algorithm does, and we get to it in two modules.
For a one-dimensional GMM with two components, write the density. How many parameters does it have?
66. Latent variable formulation
Introduce, for each data point \(\mathbf{x}_n\), a latent indicator \(z_n \in \{1, 2, \ldots, K\}\) that says which component generated it. The generative story is:
- Sample \(z_n\) from Categorical\((\boldsymbol{\pi})\): with probability \(\pi_k\), say "this point comes from component \(k\)."
- Sample \(\mathbf{x}_n\) from \(\mathcal{N}(\boldsymbol{\mu}_{z_n}, \boldsymbol{\Sigma}_{z_n})\).
Marginalising out \(z_n\) recovers the GMM density:
\[ p(\mathbf{x}_n) = \sum_k p(z_n = k) \, p(\mathbf{x}_n \mid z_n = k) = \sum_k \pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k). \]
The complete-data log-likelihood
If you knew the \(z_n\), the log-likelihood would split into independent per-component pieces:
\[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \sum_n \log \pi_{z_n} + \sum_n \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_{z_n}, \boldsymbol{\Sigma}_{z_n}). \]
Maximising this, if you knew the \(z_n\), is trivial: each component's parameters are estimated from its own assigned points by ordinary Gaussian MLE.
The problem is that you do not know the \(z_n\). The EM algorithm gets around this by alternating: given the parameters, compute the posterior over \(z_n\); given the posterior over \(z_n\), update the parameters. Each step is closed-form; the iterations converge.
If you somehow knew the true component assignments \(z_n\) for every data point, how would you compute the MLE for \(\boldsymbol{\mu}_k\) and \(\boldsymbol{\Sigma}_k\)?
67. The EM algorithm
The Expectation-Maximisation algorithm. Two steps, repeated until convergence.
E-step (Expectation)
Compute the posterior probability of each latent assignment given the current parameters. For each point \(n\) and each component \(k\):
\[ \gamma_{nk} = p(z_n = k \mid \mathbf{x}_n, \boldsymbol{\theta}) = \frac{\pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_j \pi_j \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}. \]
This is just Bayes' theorem applied to the discrete latent. The numerator is the joint probability of the point and the assignment; the denominator normalises across all assignments. The \(\gamma_{nk}\) are called responsibilities: each data point is "softly assigned" to components, with point \(n\) splitting its weight across components in proportion to how plausible each is.
M-step (Maximisation)
With responsibilities fixed, update the parameters to maximise the expected complete-data log-likelihood. The updates have closed form:
\[ \pi_k = \frac{1}{N} \sum_n \gamma_{nk}, \quad \boldsymbol{\mu}_k = \frac{\sum_n \gamma_{nk} \mathbf{x}_n}{\sum_n \gamma_{nk}}, \]
\[ \boldsymbol{\Sigma}_k = \frac{\sum_n \gamma_{nk} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top}{\sum_n \gamma_{nk}}. \]
Each update is a weighted version of the standard Gaussian MLE, with the weights being the responsibilities. \(\sum_n \gamma_{nk}\) is the "effective number of points" assigned to component \(k\).
Why it works
Each iteration cannot decrease the log-likelihood. The proof is via Jensen's inequality: the EM objective is a tight lower bound on the log-likelihood that you alternately tighten (E-step) and maximise (M-step). The algorithm converges to a local maximum of the log-likelihood, though not necessarily the global one.
K-means as a special case
If you fix all covariances to \(\sigma^2 \mathbf{I}\) and let \(\sigma^2 \to 0\), the responsibilities become "hard" (1 for the closest mean, 0 elsewhere) and EM reduces to K-means clustering. K-means is the limiting case of EM on a mixture of identical isotropic Gaussians.
Why is EM guaranteed to converge to a local maximum, not necessarily the global maximum, of the log-likelihood?
68. Convergence and pitfalls
EM is reliable but has a few failure modes worth knowing.
Local maxima
EM gets stuck at local maxima. Different initialisations give different fits. The standard remedy is to run EM many times from random starting points and keep the result with the highest likelihood. K-means++ initialisation (a smart way to spread initial means apart) is also widely used.
Singular covariances
If a component's covariance matrix becomes singular (a very flat Gaussian aligned with a few data points), the likelihood diverges to infinity. Pure MLE is broken in this regime: you can keep tightening the covariance around fewer and fewer points and the likelihood keeps increasing. Workarounds:
- Add a small regularisation \(\epsilon \mathbf{I}\) to each covariance update, keeping it well-conditioned.
- Use a Bayesian formulation with a prior on covariances that prevents collapse.
- Restart the offending component from random data points.
How many components?
Like in PCA, choosing \(K\) is a model-selection question. Approaches: cross-validation on held-out likelihood, information criteria like AIC or BIC, or fully Bayesian alternatives like Dirichlet process mixtures that learn \(K\) from the data.
Computational cost
Each EM iteration is \(O(N K D^2)\) for full covariances. For high-dimensional data, full covariance per cluster is too many parameters; use diagonal or shared covariance structures, or apply PCA first to reduce \(D\).
That closes Part X. The Gaussian mixture model is the simplest interesting probabilistic model and EM is one of the most useful general algorithms in ML. Almost any problem with hidden structure (clustering, missing data, hidden Markov models, factor analysis) can be tackled with the same alternating-optimisation idea. The next part is the last canonical ML algorithm: classification with support vector machines.
You initialise GMM with three random points but the data has two clear clusters. What is likely to happen during EM?
Part XI. Classification with SVMs
The maximum-margin idea.
Classification is the problem of assigning each input to one of several discrete categories. The support vector machine is a classifier built on a single principle: separate the classes with the widest possible buffer. It has a beautiful geometric derivation, a clean dual formulation, and the kernel trick for handling non-linear boundaries.
69. The classification problem
You have data points \((\mathbf{x}_n, y_n)\) where \(\mathbf{x}_n \in \mathbb{R}^D\) and \(y_n \in \{-1, +1\}\) (the binary case). You want a function \(f(\mathbf{x})\) such that \(f(\mathbf{x}_n)\) and \(y_n\) agree on as many training points as possible, and the function generalises to new points.
For a linear classifier, the function takes the form
\[ f(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b). \]
The set \(\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} + b = 0\}\) is a hyperplane in \(\mathbb{R}^D\), and the classifier predicts +1 on one side and \(-1\) on the other. Training the classifier means finding \(\mathbf{w}\) and \(b\).
Linearly separable case
If there exists a hyperplane that perfectly separates the two classes, the data is linearly separable. In this case, infinitely many separating hyperplanes exist. The question becomes: which one is best?
Intuition: the hyperplane that has the largest margin to the closest points on either side. A wide buffer means the classifier is robust: small perturbations of the inputs do not flip the prediction. The SVM picks this hyperplane.
If your data is in 2D and you have +1 points at (1, 1), (2, 2) and -1 points at (-1, -1), (-2, -2), what is a separating hyperplane? Is there only one?
70. Separating hyperplanes and the margin
For a hyperplane \(\mathbf{w}^\top \mathbf{x} + b = 0\), the signed distance from a point \(\mathbf{x}\) to the hyperplane is \((\mathbf{w}^\top \mathbf{x} + b) / \|\mathbf{w}\|\). The distance is positive on the +1 side and negative on the -1 side.
For a labeled point \((\mathbf{x}_n, y_n)\), the quantity \(y_n(\mathbf{w}^\top \mathbf{x}_n + b) / \|\mathbf{w}\|\) is the signed margin. It is positive when the point is on the correct side and negative when misclassified. The minimum signed margin over the dataset is the geometric margin:
\[ \rho = \min_n \frac{y_n(\mathbf{w}^\top \mathbf{x}_n + b)}{\|\mathbf{w}\|}. \]
The maximum-margin classifier picks \(\mathbf{w}\) and \(b\) to maximise \(\rho\) subject to the data being correctly classified.
Why maximise margin?
Three justifications:
- Robustness. A wide-margin classifier is less sensitive to small perturbations in the input.
- Generalisation. Vapnik-Chervonenkis theory shows that wide-margin hyperplanes have a low VC dimension on bounded data, which implies better generalisation bounds.
- Uniqueness. Among the infinitely many separating hyperplanes for separable data, the maximum-margin one is unique.
This is the principle the rest of the part formalises into a tractable optimisation problem.
For two points at (-1, 0) and (1, 0) with labels -1 and +1, what is the maximum-margin separating hyperplane? What is the margin?
71. The primal SVM
To turn the maximum-margin idea into a tractable problem, we exploit a freedom: the same hyperplane is described by infinitely many \((\mathbf{w}, b)\) (any positive scaling). Fix the scaling so that the closest points have margin exactly 1: \(y_n (\mathbf{w}^\top \mathbf{x}_n + b) \ge 1\) for all \(n\), with equality on the closest points. The geometric margin is then \(1/\|\mathbf{w}\|\), so maximising it is equivalent to minimising \(\|\mathbf{w}\|\), or its square.
The hard-margin SVM
For linearly separable data, the SVM problem in primal form is:
\[ \min_{\mathbf{w}, b} \tfrac{1}{2} \|\mathbf{w}\|^2 \quad \text{subject to} \quad y_n(\mathbf{w}^\top \mathbf{x}_n + b) \ge 1 \text{ for all } n. \]
This is a quadratic programming problem with linear constraints. It has a unique solution (since the objective is strictly convex and the feasible set is convex), and standard QP solvers can find it efficiently.
Why the constraints look this way
The constraint \(y_n(\mathbf{w}^\top \mathbf{x}_n + b) \ge 1\) does two things at once. First, the sign forces the point to be on the correct side: \(y_n\) and \(\mathbf{w}^\top \mathbf{x}_n + b\) must have the same sign. Second, the magnitude forces it to be at least one "margin unit" away from the boundary. Points on the margin satisfy the constraint with equality; points further away satisfy it strictly.
What if the data is not separable?
The hard-margin formulation has no solution if no separating hyperplane exists. For real data this is the rule, not the exception. The soft-margin SVM (Module 74) introduces slack variables that allow some points to violate the margin at a cost. For now, focus on the separable case.
For the SVM primal, why is minimising \(\frac{1}{2}\|\mathbf{w}\|^2\) (instead of \(\|\mathbf{w}\|\)) more convenient mathematically? Hint: think about derivatives.
72. The dual SVM
The Lagrangian of the primal SVM is
\[ \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \tfrac{1}{2}\|\mathbf{w}\|^2 - \sum_n \alpha_n \big[ y_n(\mathbf{w}^\top \mathbf{x}_n + b) - 1 \big], \quad \alpha_n \ge 0. \]
Differentiating with respect to \(\mathbf{w}\) and \(b\) and setting to zero gives
\[ \mathbf{w} = \sum_n \alpha_n y_n \mathbf{x}_n, \quad \sum_n \alpha_n y_n = 0. \]
The first says the optimal weight vector is a linear combination of the training inputs, weighted by their labels and the multipliers. Substituting back into the Lagrangian gives the dual problem:
\[ \max_{\boldsymbol{\alpha} \ge 0} \, \sum_n \alpha_n - \tfrac{1}{2} \sum_{n, m} \alpha_n \alpha_m y_n y_m \mathbf{x}_n^\top \mathbf{x}_m, \quad \text{subject to } \sum_n \alpha_n y_n = 0. \]
Two things have happened. First, the primal optimisation over \(D+1\) variables has become a dual optimisation over \(N\) variables (one per data point). Second, and crucially, the data only appears through inner products \(\mathbf{x}_n^\top \mathbf{x}_m\). This is the doorway to kernels.
Support vectors
By KKT conditions (Module 47), at the optimum, complementary slackness forces \(\alpha_n = 0\) for points with strict inequality (interior points), and \(\alpha_n > 0\) only for points on the margin. The points with \(\alpha_n > 0\) are the support vectors. They are the only ones that matter; everything else can be discarded after training, and the classifier is unchanged. This is where the algorithm gets its name.
For typical separable data, the number of support vectors is small (a handful), making SVMs efficient at prediction time even when \(N\) is large.
Predicting
Given a new input \(\mathbf{x}^*\), the prediction is
\[ f(\mathbf{x}^*) = \text{sign}\!\left( \sum_{n : \alpha_n > 0} \alpha_n y_n \mathbf{x}_n^\top \mathbf{x}^* + b \right). \]
Inner products with the support vectors only.
Why is the dual SVM particularly useful when \(D\) (the input dimension) is much larger than \(N\) (the number of data points)?
73. Kernels and the kernel trick
The dual SVM uses inner products only. This is the key that unlocks non-linear classification.
Suppose you map your inputs to a higher-dimensional space via some feature map \(\phi : \mathbb{R}^D \to \mathbb{R}^M\), with \(M\) possibly enormous or infinite. In the feature space, you do linear SVM. The dual problem now has \(\phi(\mathbf{x}_n)^\top \phi(\mathbf{x}_m)\) where it had \(\mathbf{x}_n^\top \mathbf{x}_m\). If \(M\) is huge, computing \(\phi\) explicitly is expensive.
The kernel trick: define a function \(k(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^\top \phi(\mathbf{x}')\) that computes the inner product in feature space without ever forming \(\phi\). For some choices of \(\phi\), this function has a closed form that is far cheaper than computing \(\phi\) and the inner product.
Common kernels
- Linear kernel. \(k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'\). The trivial case; you are doing ordinary linear SVM.
- Polynomial kernel. \(k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}' + c)^d\). Implicit feature space contains all polynomials of degree at most \(d\). Useful for moderate non-linearity.
- RBF (Gaussian) kernel. \(k(\mathbf{x}, \mathbf{x}') = \exp(-\|\mathbf{x} - \mathbf{x}'\|^2 / (2\sigma^2))\). Implicit feature space is infinite-dimensional. The most popular kernel for general-purpose SVM.
- Sigmoid kernel. \(k(\mathbf{x}, \mathbf{x}') = \tanh(a\mathbf{x}^\top \mathbf{x}' + c)\). Loosely related to neural networks. Used historically, less common now.
Mercer's theorem
Not every function \(k(\cdot, \cdot)\) is a valid kernel. Mercer's theorem says \(k\) is a valid kernel (corresponds to an inner product in some feature space) if and only if it is symmetric and positive semi-definite (the kernel matrix \(K_{nm} = k(\mathbf{x}_n, \mathbf{x}_m)\) is symmetric PSD for any choice of points). When you design a custom kernel, you check this condition.
The kernelised SVM
Replace every \(\mathbf{x}_n^\top \mathbf{x}_m\) in the dual with \(k(\mathbf{x}_n, \mathbf{x}_m)\). The optimisation is unchanged in structure, but the implicit feature space can be very high-dimensional. Predictions on new inputs use \(k(\mathbf{x}_n, \mathbf{x}^*)\) in place of \(\mathbf{x}_n^\top \mathbf{x}^*\). The algorithm never sees the explicit feature space at all.
The kernel SVM was, for about 15 years, the dominant general-purpose classifier. It has since been surpassed by deep neural networks for most tasks where data is abundant, but it remains the right choice for small-to-medium datasets and is a clean illustration of the power of the dual formulation.
What does the implicit feature space of an RBF kernel look like? Why is it infinite-dimensional?
74. Soft margins
Real data is rarely linearly separable, even after kernelisation. The hard-margin SVM has no solution in this case. The fix: allow margin violations, but charge for them.
Slack variables
Introduce a non-negative slack variable \(\xi_n \ge 0\) for each point and modify the constraint:
\[ y_n(\mathbf{w}^\top \mathbf{x}_n + b) \ge 1 - \xi_n, \quad \xi_n \ge 0. \]
If a point satisfies the original constraint (correctly classified, on the right side of the margin), \(\xi_n = 0\). If it is on the wrong side of the margin or misclassified, \(\xi_n > 0\) by the amount of the violation.
The penalised objective
Add a penalty for the slacks to the objective:
\[ \min_{\mathbf{w}, b, \boldsymbol{\xi}} \tfrac{1}{2}\|\mathbf{w}\|^2 + C \sum_n \xi_n. \]
The hyperparameter \(C > 0\) controls the trade-off. Small \(C\): tolerate violations cheaply, focus on a wide margin (more bias, less variance). Large \(C\): violations are expensive, the model tries hard to fit (less bias, more variance). \(C \to \infty\) recovers the hard-margin SVM.
The dual of the soft-margin SVM is identical to the hard-margin dual except that the constraints on \(\boldsymbol{\alpha}\) become \(0 \le \alpha_n \le C\). The inner product structure is preserved, so the kernel trick still works.
Hinge loss connection
Eliminating the slacks (\(\xi_n = \max(0, 1 - y_n(\mathbf{w}^\top \mathbf{x}_n + b))\)) gives the unconstrained form:
\[ \min_{\mathbf{w}, b} \tfrac{1}{2}\|\mathbf{w}\|^2 + C \sum_n \max(0, 1 - y_n(\mathbf{w}^\top \mathbf{x}_n + b)). \]
The second term is the hinge loss. It is zero when the point is correctly classified with margin at least 1, and grows linearly with the violation otherwise. This view connects the SVM to the modern picture of supervised learning as "minimise a loss plus a regulariser", the hinge loss is the SVM's loss of choice, and \(\|\mathbf{w}\|^2\) is its L2 regulariser.
Where to go from here
Multi-class SVMs combine binary SVMs (one-vs-all, one-vs-one). Regression SVMs (SVR) use an \(\epsilon\)-insensitive loss instead of hinge. Structured SVMs handle output spaces more complex than \(\{-1, +1\}\). All of them inherit the same dual structure, the same kernel trick, and the same support-vector sparsity.
That closes the note. You now have the math foundations and four worked applications. The combinations are nearly limitless: the same eigendecomposition tool that powers PCA also powers spectral clustering. The same EM algorithm that fits GMMs also fits hidden Markov models and topic models. The same kernel trick that powers SVM also powers Gaussian processes. Once you see the connections, every paper becomes a variant of something you already understand.