The oft repeated mantra goes as follows; “Gradient descent takes a step in the direction of steepest descent,” with which nothing is wrong, but needs to be put under the microscope.
For a loss function \(\ell : \Theta \to \mathbb{R}\), and a step size \(\alpha > 0\), the update algorithm is \[\label{eq:gradient_descent} \mathbf{\boldsymbol{\theta}} \leftarrow \mathbf{\boldsymbol{\theta}} - \alpha \nabla \ell(\mathbf{\boldsymbol{\theta}}).\] The intuitive picture is that we stand on a hilly landscape during an thick morning fog and want to go downhill. We can only sense the immediate steepness and take a step downhill along the negative gradient direction.
The first—rather obvious—point is that \(\nabla \ell(\mathbf{\boldsymbol{\theta}})\) is local information on the loss landscape at \(\mathbf{\boldsymbol{\theta}}\). Not as much as the simple value \(\ell(\mathbf{\boldsymbol{\theta}})\), it contains more information about the loss landscape in the the vicinity of \(\mathbf{\boldsymbol{\theta}}\); but absent some uniformity constraints on \(\ell\), no information about even the smallest neighborhood \(U \ni \mathbf{\boldsymbol{\theta}}\) can be gleaned. In other words—in full generality—there is nothing stopping \(\ell(\mathbf{\boldsymbol{\theta}} + \mathbf{\boldsymbol{\epsilon}}_0)\) from taking all kinds of crazy values for any fixed \(\mathbf{\boldsymbol{\epsilon}}_0\) no matter how well behaved or how gargantuan \(\nabla \ell(\mathbf{\boldsymbol{\theta}})\) is or no matter how small a perturbation \(\mathbf{\boldsymbol{\epsilon}}_0\) is.
The knowledge we have is infinitesimal, meaning it only pertains to the tangent directions \(\mathbf{\boldsymbol{v}} \in T_{\mathbf{\boldsymbol{\theta}}}\Theta\), i.e. we only know about the directional derivatives.
Thus, the second point to ponder reveals itself. The question which “the gradient!” answers to is not a problem about the function \(\ell\) on the manifold \(\Theta\), nor is it a question about \(\ell\) only at the point \(\mathbf{\boldsymbol{\theta}}\). It has to be, on the tangent space, i.e. reverse engineering the problem to which [eq:gradient_descent] is an answer to, we arrive at finding the direction of steepest descent of the linearization of \(\ell\) at \(\mathbf{\boldsymbol{\theta}}\), \[\begin{aligned} L:& T_{\mathbf{\boldsymbol{\theta}}}\Theta \rightarrow \mathbb{R}\\ &\mathbf{\boldsymbol{v}} \mapsto \ell(\mathbf{\boldsymbol{\theta}}) + \nabla \ell(\mathbf{\boldsymbol{\theta}})^\top \mathbf{\boldsymbol{v}}. \end{aligned}\] This is a function of vectors \(\mathbf{\boldsymbol{v}} \in T_{\mathbf{\boldsymbol{\theta}}}\Theta\) in the tangent space, and among these vectors we will search.
Third point is hidden in the the superlatives steepest/fastest, meaning that we have solved an optimization problem, and found extrema. We compare the rate of change of this linearized function \(L\) in various directions, but rate of change according to what!? We need a notion of unit distance if we are to compare the change in \(L\) when a unit distance has been traversed. Riemannian metrics give precisely that, i.e. lengths of vectors on tangent spaces. In fact they give a bit more, a Riemannian metric means there is a bilinear form for every tangent space of a manifold \[\begin{aligned} \omega_{\mathbf{\boldsymbol{\theta}}} :& T_{\mathbf{\boldsymbol{\theta}}}\Theta \times T_{\mathbf{\boldsymbol{\theta}}}\Theta \to \mathbb{R}.\\ & (\mathbf{\boldsymbol{v}}, \mathbf{\boldsymbol{w}}) \mapsto \omega_{\mathbf{\boldsymbol{\theta}}}(\mathbf{\boldsymbol{v}}, \mathbf{\boldsymbol{w}}) \end{aligned}\] which should be thought of as an inner product.12 giving us lengths of vectors \(\|\mathbf{\boldsymbol{v}}\|_{\mathbf{\boldsymbol{\theta}}}^2:= \omega_{\mathbf{\boldsymbol{\theta}}}(\mathbf{\boldsymbol{v}}, \mathbf{\boldsymbol{v}})\) and cosines of angles between two vectors.
Different metrics would give us different steepest descent directions. Indeed, if for a metric, going more along a certain route means less distance is traveled then that is bound to effect the direction for which \(L(\mathbf{\boldsymbol{v}})\) loses value the most.
The metric (in the standard basis) can be written as \(\omega_{\mathbf{\boldsymbol{\theta}}}(\mathbf{\boldsymbol{v}}, \mathbf{\boldsymbol{v}}) = \mathbf{\boldsymbol{v}}^\top F_{\mathbf{\boldsymbol{\theta}}} \mathbf{\boldsymbol{v}}\) where the \(ij\)-th entry of the matrix \(F_{\mathbf{\boldsymbol{\theta}}}\) is given by \(\omega_{\mathbf{\boldsymbol{\theta}}}(\mathbf{\boldsymbol{e}}_i, \mathbf{\boldsymbol{e}}_j)\). Riemannian metrics are symmetric and positive definite, and so too is the matrix \(\mathbf{\boldsymbol{F}}_{\mathbf{\boldsymbol{\theta}}}\). The Euclidean metric corresponds to the identity matrix.
We therefore solve the constrained optimization problem \[\begin{aligned} \text{minimize } & L(\mathbf{\boldsymbol{v}}) \\ \text{subject to } & \|\mathbf{\boldsymbol{v}}\|_{\mathbf{\boldsymbol{\theta}}} = 1. \end{aligned}\] In coordinates, the Lagrangian can be written as \[\mathcal{L}(\mathbf{\boldsymbol{v}}, \beta) = L(\mathbf{\boldsymbol{v}}) + \beta(\mathbf{\boldsymbol{v}}^\top F_{\mathbf{\boldsymbol{\theta}}} \mathbf{\boldsymbol{v}} - 1) = \ell(\mathbf{\boldsymbol{\theta}}) + \nabla \ell(\mathbf{\boldsymbol{\theta}})^\top \mathbf{\boldsymbol{v}} + \mathbf{\boldsymbol{v}}^\top F_{\mathbf{\boldsymbol{\theta}}} \mathbf{\boldsymbol{v}}\] and solving for \(\frac{\partial \mathcal{L}}{\partial \mathbf{\boldsymbol{v}}} = \mathbf{\boldsymbol{0}}\), \(frac{\partial \mathcal{L}}{\partial \beta} = 0\), we get \[\nabla\ell(\mathbf{\boldsymbol{\theta}} + 2 \beta F_\mathbf{\boldsymbol{\theta}} \mathbf{\boldsymbol{v}} = 0\]
Bilinear functions can also be viewed as linear functions from the tensor product \(\omega_{\mathbf{\boldsymbol{\theta}}} : T_{\mathbf{\boldsymbol{\theta}}}\Theta \otimes T_{\mathbf{\boldsymbol{\theta}}}\Theta\) without any loss of information↩︎
We will also require our metrics to be postive definite, i.e.the inner product satisfies \(\omega_{\mathbf{\boldsymbol{\theta}}} (\mathbf{\boldsymbol{v}}, \mathbf{\boldsymbol{v}}) \geq 0\) for all tangent vectors \(\mathbf{\boldsymbol{v}} \in T_{\mathbf{\boldsymbol{\theta}}}\Theta\) with equality only when \(\mathbf{\boldsymbol{v}}\) vanishes.↩︎