Setting the stage: objects & notation

Let \(\mathcal{X}\times \mathcal{Y}\) be the data space split along an input-label axis. The hypothesis class is a collection of functions \(f \in \mathcal{F}\) \[f : \mathcal{X}\times \Theta \to \mathcal{Y}.\] For example, the hypothesis class could be a neural network with \(P\) weights (and biases), then \(\Theta = \mathbb{R}^P\) and \(f(\mathbf{\boldsymbol{x}}, \mathbf{\boldsymbol{\theta}})\) would be the function defined by the network.

A solution to the supervised learning problem could be a \(\mathbf{\boldsymbol{\theta}}^*\) (a set of parameters) such that \[f(\mathbf{\boldsymbol{x}}_i, \mathbf{\boldsymbol{\theta}}^*) \approx y_i\] for all given (training) data points \(\{(\mathbf{\boldsymbol{x}}_i, y_i)\}_{i = 1}^N \subseteq \mathcal{X}\times \mathcal{Y}\). The whole of machine learning enterprise rests on the assumption, or rather observation that \(f\) also predicts the label for yet unseen data; i.e. \(f(\mathbf{\boldsymbol{x}}_{\text{new}}, \mathbf{\boldsymbol{\theta}}^*) \approx y_{\text{new}}\) for yet unseen data \((\mathbf{\boldsymbol{x}}_{\text{new}}, y_{\text{new}})\) assumed to come from the same distribution as the training data in \(\mathcal{X}\times \mathcal{Y}\). Whether this is possible heavily depends on the hypothesis class \(\mathcal{F}\). Those functions describable by a neural network and easily discoverable by the common gradient based algorithms have proven to work well for distributions of data we have collected and thrown at these machine learning models.

Finding this \(\mathbf{\boldsymbol{\theta}}^*\) is typically achieved by a variant of gradient descent on the loss function defined of the form \[\label{eq:lossfn} \ell(\mathbf{\boldsymbol{\theta)}} = \frac{1}{N} \sum_{i = 1}^N \ell_i(\mathbf{\boldsymbol{\theta}}) + R(\mathbf{\boldsymbol{\theta}}) \quad \text{ with } \quad \ell_i(\mathbf{\boldsymbol{\theta}}) = c(f(\mathbf{\boldsymbol{x}}_i, \mathbf{\boldsymbol{\theta}}), y_i),\] where \(R :\Theta \to \mathbb{R}_{\geq 0 }\) is is called the regularizer, responsible for biasing the hypothesis space to simpler functions (usually meaning smaller norm parameters), and \(c : \mathcal{Y}\times \mathcal{Y}\to \mathbb{R}_{\geq 0}\) is called the cost function, measuring how far a prediction \(\hat y\) is from \(y\) with \(c(\hat y, y)\) (usually \(c(y, y) = 0\)).