Mathematical Vignettes

Below are short expositions of machine learning topics, with a mathematical focus. The posts are not meant to be a first introduction to the topic (except perhaps for a mathematician), but rather exhibit the way I think about these topics and what I personally focus on.

2025/08/10

Learn a probability distribution on the space of parameters so that whenever you sample from this distribution you get good models, and also not the same one over and over. That’s the goal.

We follow the notation of the setting with parametrized model \(f\), parameter space \(\Theta\), loss function \(\ell:\Theta \to \mathbb{R}\), loss contributions \(\ell_i: \Theta \to \mathbb{R}\) for each labeled data point \((\mathbf{\boldsymbol{x}}_i, y_i) \in \mathcal{X} \times \mathcal{Y}\), and maybe even a regularizer \(R: \Theta \to \mathbb{R}\).... Read full post →

2025/08/01

The differential of a loss function \(\,\mathrm{d}\ell(\mathbf{\boldsymbol{\theta}})\) is a row vector, and the gradient \(\nabla\ell(\mathbf{\boldsymbol{\theta}})\) is its transpose, a column vector. But what does this all mean?... Read full post →

2025/06/15

The oft repeated mantra goes as follows; “Gradient descent takes a step in the direction of steepest descent,” with which nothing is wrong, but needs to be put under the microscope.

For a loss function \(\ell : \Theta \to \mathbb{R}\), and a step size \(\alpha > 0\), the update algorithm is \[\label{eq:gradient_descent} \mathbf{\boldsymbol{\theta}} \leftarrow \mathbf{\boldsymbol{\theta}} - \alpha \nabla \ell(\mathbf{\boldsymbol{\theta}}).\] The intuitive picture is that we stand on a hilly landscape during an thick morning fog and want to go downhill. We can only sense the immediate steepness and take a step downhill along the negative gradient direction.... Read full post →

2025/06/05

Let \(\mathcal{X}\times \mathcal{Y}\) be the data space split along an input-label axis. The hypothesis class is a collection of functions \(f \in \mathcal{F}\) \[f : \mathcal{X}\times \Theta \to \mathcal{Y}.\] For example, the hypothesis class could be a neural network with \(P\) weights (and biases), then \(\Theta = \mathbb{R}^P\) and \(f(\mathbf{\boldsymbol{x}}, \mathbf{\boldsymbol{\theta}})\) would be the function defined by the network.... Read full post →

This index was automatically generated on August 12, 2025 at 05:21 PM