The differential of a loss function \(\,\mathrm{d}\ell(\mathbf{\boldsymbol{\theta}})\) is a row vector, and the gradient \(\nabla\ell(\mathbf{\boldsymbol{\theta}})\) is its transpose, a column vector. But what does this all mean?

Vectors

A vector space \(V\) over \(\mathbb{R}\) is a set of objects (called vectors¹) with two operations called addition and scalar multiplication.

This can be rather abstract: elements of \(V\) can be thought of as a column of numbers (your usual column vectors), they can be a collection of differential operators (it is fine as long as you can add two differential operators and multiply by a scalar), or matrices themselves are vectors, or they can be abstract tangent vectors attached to a manifold at a point—which are technically defined as directional derivative operators for smooth functions on the manifold (this is one of many equivalent ways to define tangent vectors on a manifold, and it is the best one).

As a concrete example take \(X\) any set and consider the collection of all real valued functions \(f: X \to \mathbb{R}\). Then upon pointwise addition we can define addition of functions, that is the value of \(f + g\) at a point \(x\in X\) is defined to be \((f + g)(x) := f(x) + g(x)\), where on the right hand side the addition is in real numbers. Similarly \((cf)(x) := c(f(x))\) is the scalar multiplication. So functions form a vector space.

If the vector space is finite dimensional then there is a basis, meaning an ordered collection of linearly independent vectors \(e_1, e_2, \ldots, e_n\) such that any \(v \in V\) can be written uniquely as a linear combination of these basis vectors. That is there are a unique collection of real numbers \(c_1, c_2, \ldots, c_n \in \mathbb{R}\) such that \(v = c_1 e_1 + c_2 e_2 + \cdots + c_n e_n\). Every basis of \(V\) has the same number of vectors, which we then define to be the dimension of \(V\), denoted as \(\dim(V) = n\).

Given a basis of \(V\) we can represent a vector \(v\) by a (column) vector of real numbers \[\label{eq:ebasisrep} v \leftrightsquigarrow \mathbf{\boldsymbol{v}} = \begin{bmatrix} c_1\\ c_2\\ \vdots \\ c_n \end{bmatrix}\] Although it is not fully standard nomenclature I will be rigid and keep representing elements of a vector space as a column vector. Row vectors will be reserved for co-vectors also known as linear functionals or dual vectors (see below). The space of linear functionals is also a vector space, so in that sense what constitutes a vector and what constitutes a covector is not a mathematically justifiable distinction. If you take the linear functionals as your vector space then yesterday’s covectors become today’s vectors and yesterday’s vectors become today’s covectors. Which we call a vector and which we call a covector tells more about us, and which object we consider to be more basic and which more derived (not in the sense of derivatives).

Also although this is not standard notation either it might be wise to distinguish between the vector \(v\in V\) and its presentation as a column vector \(\mathbf{\boldsymbol{v}} \in \mathbb{R}^n\) by using the bold notation with the same letter.

If one changes to another basis \(f_1, \ldots, f_n\) then the same vector can now be written as \(v = d_1f_1 + \cdots + d_n f_n\) where \(d_1, \ldots, d_n \in \mathbb{R}\) are now new numbers. Then the vertical-box-of-numbers representation of the vector changes \[\label{eq:fbasisrep} v \leftrightsquigarrow \mathbf{\boldsymbol{v}} = \begin{bmatrix} d_1\\ d_2\\ \vdots \\ d_n \end{bmatrix}\] Because of this, perhaps, if we were being really strict then one should have written down the basis dependence in the column vectors as \[[\mathbf{\boldsymbol{v}}]_{\mathcal{B}_1} = \begin{bmatrix} c_1\\ c_2\\ \vdots\\ c_n \end{bmatrix}_{\mathcal{B}_1} \qquad [\mathbf{\boldsymbol{v}}]_{\mathcal{B}_2} = \begin{bmatrix} d_1\\ d_2\\ \vdots\\ d_n \end{bmatrix}_{\mathcal{B}_2}\] where we call the bases as \(\mathcal{B}_1 = \{e_1, \ldots, e_n\}\) and \(\mathcal{B}_2 = \{f_1, \ldots, f_n\}\) . People usually don’t do such a thing though, but if there are multiple legitimate bases lying around it might be smart to keep track of the basis dependence.

Remark 1. If there is a slight unease by seeing two different things for the same vector, think the same differential operator perhaps, or the same tangent arrow stuck to the side of a manifold. It is harder to think of distinct presentations of a vector if one’s conception of a vector begins and ends with a vertical box of numbers. But only makes sense if vectors are conceived simply and abstractly as elements of some set where you can add the elements and multiply the elements with a real number, whatever add and multiply mean as long as said addition and multiplication satisfy some axioms. In short, vectors are simply elements of a vector space. And we can have many distinct ways of naming it, a vector by any other name would point in the same direction (à la Shakespeare).

Can we calculate how the different column vector presentations (the \(c\)’s and the \(d\)’s) are related? Yes. The idea is to write one set of basis elements in terms of the other. Precisely speaking, write \[f_j \stackrel{\mathcal{B}_1}{\leftrightsquigarrow} [\mathbf{\boldsymbol{f_j}}]_{\mathcal{B}_1} = \begin{bmatrix} a_{1j} \\ a_{2j} \\ \vdots\\ a_{nj} \end{bmatrix}_{\mathcal{B}_1},\] meaning \(f_j = a_{1j} e_1 + \cdots + a_{nj} e_n\) for certain \(a_{ij} \in \mathbb{R}\). Then creating a matrix (a box of numbers) \(A\) such that the \(j\)-th column of said box are the numbers coming from \(f_j\) above, then we have the relationship \[\label{eq:changeOfBasis} \begin{bmatrix} c_1\\ c_2\\ \vdots \\ c_n \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n}\\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix} \begin{bmatrix} d_1 \\ d_2\\ \vdots \\d_n \end{bmatrix}\] Or in other words \([\mathbf{\boldsymbol{v}}]_{\mathcal{B}_1} = A [\mathbf{\boldsymbol{v}}]_{\mathcal{B}_2}\). This is how you connect the two presentations. How to remember which side is which? I always forget. One mnemonic is to consider the following, in basis \(\mathcal{B}_2\) the vector \(v=f_1\) would be represented by the standard unit vector \(\begin{bmatrix} 1 & 0 & 0 & \ldots & 0 \end{bmatrix}^\top \in \mathbb{R}^n\) and if this is taken as the \(d\)’s on the right hand side, the multiplication picks up the first column of the matrix which is indeed the representation of \(f_1\) in the basis \(\mathcal{B}_1\). Similarly for all the other standard unit vectors in \(\mathbb{R}^n\).

Linear Maps

A linear map \(A:V \to W\) is a function which respects the additive and the scalar multiplication structures on the vector spaces. The image of a vector \(v\in V\) under the function \(A\) is usually denoted by \(Av\) unlike the customary \(A(v)\) for functions, though both are acceptable.

Linear functions (equivalently maps, or transformations²) satisfy \[\begin{aligned} A(v + w) &= Av + Aw \qquad \text{ for all } v, w \in V\\ A(cv) &= c (Av) \qquad \text{ for all } v \in V \text{ and } c \in \mathbb{R}. \end{aligned}\] Note that in the first line the addition on the left hand side is the addition in \(V\) and on the right hand side it is the addition in \(W\) as \(Av, Aw \in W\). Same for the scalar multiplication.

If one chooses a basis \(\mathcal{B}_1 = \{e_1, \ldots, e_n\}\subset V\) in the domain and another basis \(\mathcal{B}_2 = \{f_1, \ldots, f_m\} \subset W\) in the range (I still use the \(e\)’s and \(f\)’s but they have different meaning now, they’re bases of distinct vector spaces) then we can express the relationship between \(v \in V\) written as a column vector \([\mathbf{\boldsymbol{v}}]_{\mathcal{B}_1}\) and the vector \(w = Av \in W\) written as a column vector \([\mathbf{\boldsymbol{w}}]_{\mathcal{B}_2}\) using matrices.

The coefficients of this matrix are given as \[a_{ij} = \langle [\mathbf{\boldsymbol{f_i}}]_{\mathcal{B}_2}, [\mathbf{\boldsymbol{A e_j}}]_{\mathcal{B}_2} \rangle.\] Then constructing the box of numbers \(A = [a_{ij}]_{i\in 1, \ldots, m; j = 1, \ldots, n}\), i.e. a matrix, we get \(\mathbf{\boldsymbol{w}} = A \mathbf{\boldsymbol{v}}\). Using the same letter \(A\) for the linear map and the box of numbers is an abuse of notation, but it is customary. And what one needs to remember is that in writing a linear map as a matrix one assumes a choice of basis for \(V\) and a choice of basis for \(W\).

Here what we mean by the inner product is the dot product you know and love of column vectors (written with respect to the \(\mathcal{B}_2\) basis). This may seem obvious (as many other things in this note).

Linear Functionals

Linear functionals are linear maps from a vector space \(V\) to the one dimensional vector space \(\mathbb{R}\). The space of linear maps between two vector spaces form a vector space where addition is considered pointwise. Therefore linear functionals on a vector space \(V\) is also a vector space and it is denoted by one of \(V^*, V'\), or the my personal favorite \(V^\vee\).

Linear functionals are also called covectors, in a sense they are companion to vectors. Given a \(\lambda \in V^\vee\) (it is common to use small Greek letters for linear functionals) and \(v \in V\) \[\label{eq:pairing} \lambda(v) \in \mathbb{R}\text{ is also denoted by } \langle \lambda, v \rangle.\] This is not an inner product it is simply the pairing given by the evaluation of the linear functional \(\lambda\) at the vector \(v\). But it uses the same notation as an inner product. The reason one would prefer this abuse of a notation is that if one had an inner product \(\langle \cdot, \cdot \rangle\) then every vector \(w\in V\) would define a linear functional by \(\lambda_w : v \mapsto \langle w, v \rangle\). In a finite dimensional space if simply chooses a non-degenerate inner product, then in fact every \(\lambda \in V^\vee\) is of the form \(\lambda = \lambda_w\) defined above, for some vector \(w \in V\). This is called the Riesz representation theorem, and actually holds in infinite dimensional Hilbert spaces too³.

The bracket simply evaluates the linear functional at the vector, and is called a pairing. Although equivalent, using a pairing \(\langle \cdot, \cdot \rangle : V^\vee \times V \to \mathbb{R}\) is preferable to using an inner product \(\langle\cdot, \cdot \rangle : V \times V \to \mathbb{R}\) (imho) because a pairing doesn’t make an implicit choice of an inner product. It lets us be very explicit when the time comes to make that choice.

Given a vector space \(V\) and a basis \(\mathcal{B} = \{e_1, \ldots, e_n\}\) we have a dual basis \(\mathcal{B}^\vee = \{\delta_1, \ldots, \delta_n\}\) of \(V^\vee\) where the dual basis vectors satisfy \[\langle \delta_j, e_i \rangle = \begin{cases} 0 & \text{ if } i \neq j, \\ 1 &\text{ if } i = j. \end{cases}\]

In matrix notation covectors are represented by \(n\times 1\) matrices, i.e. row vectors. so we have \[\lambda \leftrightsquigarrow \begin{bmatrix} y_1 & y_2 & \cdots & y_n \end{bmatrix} \quad \text{ if } \quad \lambda = y_1 \delta_1 + y_2 \delta_2 + \cdots + y_n \delta_n \in V^\vee.\] But of course if we think of \(V^\vee\) as a vector space itself, with basis \(\{\delta_1, \delta_1, \ldots, \delta_n\}\) then we should have represented \(\lambda\) in vector notation as a column vector \[[\lambda]_{\mathcal{B}} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}.\]

If things are confusing to you at this point, they are only confusing because they are simple. Many things overlap, and one has to carefully peel of the different layers of meaning that are on top of another like the pile of coats on top of your bed at a large family gathering.

Also insisting on presenting a covector horizontally vs. vertically is a personal choice, and it is impossible to insist on this consistently since covectors are also vectors in the end.

Derivatives as linear maps

Let \(M\) and \(N\) be two manifolds. Then consider a differentiable map \(f: M\to N\). The definition of a derivative (equivalently differential) of \(f\) at a point is the best linear approximation to the function at that point. What do we mean by it? If \(M\) is an open subset in \(\mathbb{R}^n\) and the manifold \(N\) is an open subset in \(\mathbb{R}^n\) then we look at \[\label{eq:derivativeDifference} \operatorname{Diff}(\mathbf{\boldsymbol{h}},t) := f(\mathbf{\boldsymbol{x}} + t \mathbf{\boldsymbol{h}}) - f(\mathbf{\boldsymbol{x}}) - tA\mathbf{\boldsymbol{h}}\] for all \(\mathbf{\boldsymbol{h}}\in \mathbb{R}^n\) and small enough \(t\in \mathbb{R}\) such that \(\mathbf{\boldsymbol{x}} + t\mathbf{\boldsymbol{h}}\in M\) so \(f\)-ing it still makes sense. Here \(A: \mathbb{R}^n \to \mathbb{R}^m\) is a linear map.

If \(f\) is differentiable at \(\mathbf{\boldsymbol{x}}\) then \(\operatorname{Diff}(\mathbf{\boldsymbol{h}},t) = o(t)\) as \(t\to 0\) for all \(\mathbf{\boldsymbol{h}} \in \mathbb{R}^n\) for one linear map \(A\) and for one \(A\) only. In other words there is only one linear map which captures the first Taylor approximation of \(f\).

Such an \(A\)—if it exists—is called the first derivative (or differential) of \(f\) at \(\mathbf{\boldsymbol{x}}\). It is denoted by one of the many symbols \[\,\mathrm{d}f_p, \,\mathrm{d}f\big|_p, \,\mathrm{d}f(p), Df(p), Df\big|_p, f'(p) .\]

For general manifolds the derivative at a point \(p \in M\) is a linear map \(\,\mathrm{d}f_p : T_p M \to T_{f(p)} N\) between the corresponding tangent spaces. The argument is essentially the same since the notions of derivative or tangent vector on manifolds are all defined by pulling them back to the \(\mathbb{R}^n, \mathbb{R}^m\) case by using their charts. So everything is defined by coordinate charts, which are defined on open subsets of Euclidean space. So no real generality was lost. We understood just as we would in the general case while considering the case of open subsets of \(\mathbb{R}^n, \mathbb{R}^m\).

There is a historical reason for calling this map a differential vs. a derivative. But that doesn’t matter. They are basically the same thing.

The gradient, however, is another beast. The gradient is a vector in \(T_pM\), it is not a covector. In order to get a vector that from a covector (for functions \(f: M\to N\) with \(N \subset \mathbb{R}\) the derivative is a cotangent vector) we need an inner product. That is for the next section.

Let us instead cap off this section by what these covectors would look like on a manifold if one were to choose a basis. First we choose a basis of vectors in the tangent space \(T_p M\). Let us call it \(\mathcal{B} = \{h_1, \ldots, h_m\} \in T_pM\). Given an \(f : M \to \mathbb{R}\) the differential is a linear map \(T_pM \to \mathbb{R}\) (since \(T_{f(p)}\mathbb{R}\cong \mathbb{R}\)) and with respect to this basis this linear map is given by the matrix \[\,\mathrm{d}f_p \leftrightsquigarrow \begin{bmatrix} \,\mathrm{d}f_p[h_1] & \,\mathrm{d}f_p[h_2]& \cdots & \,\mathrm{d}f_p[h_m] \end{bmatrix}.\] The quantities \(\,\mathrm{d}f_p[h_i]\) are numbers which give directional derivative of \(f\) in the direction of \(h_i\). The directional derivative of \(f\) at \(p\) in the direction of \(h\in T_pM\) can be computed using any smooth path \(\gamma\) passing through \(p\) in the direction of \(h\), that is \(\gamma: (-1,1) \to M\), \(\gamma(0) = 1\) and \(\gamma'(0) =h \in T_pM\). Then \(f \circ \gamma: (-1,1) \to \mathbb{R}\) and we can take the standard 1-dimensional derivative \((f \circ \gamma)'(0) \in \mathbb{R}\). That derivative, that number, is the directional derivative \(\,\mathrm{d}f_p [h] \in \mathbb{R}\) (which is independent of the choice of \(\gamma\) as long as \(\gamma(0) = p\) and \(\gamma'(0) = h\)).

One intuitive way to define tangent vectors on the manifold \(M\) at \(p\) is via smooth paths \(\gamma\) on \(M\) that pass through \(p\). One intuitively thinks of \(\gamma'(0)\) as the tangent vector.

So to reiterate, if \([\mathbf{\boldsymbol{h}}]_{\mathcal{B}} = \begin{bmatrix} c_1 & c_2 & \cdots & c_n \end{bmatrix}^\top\) is the vector representation of the tangent vector \(h \in T_pM\) in the basis \(\mathcal{B}\) (to reiterate once again this means \(h = c_1 h_1 + c_2 h_2 + \cdots + c_m h_m\)) then the directional derivative of a function \(f : M \to \mathbb{R}\) in the direction of \(h\) can be computed as \[\begin{aligned} \,\mathrm{d}f_p [h] &= [\,\mathrm{d}f_p]_{\mathcal{B}^\vee} [\mathbf{\boldsymbol{h}}]_{\mathcal{B}} = \begin{bmatrix}\,\mathrm{d}f_p[h_1] & \,\mathrm{d}f_p[h_2]& \cdots & \,\mathrm{d}f_p[h_m] \end{bmatrix} \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_m \end{bmatrix} \\ &= c_1 \,\mathrm{d}f_p[h_1] + c_2 \,\mathrm{d}f_p[h_2] + \cdots + c_m \,\mathrm{d}f_p[h_m] . \end{aligned}\]

Bilinear forms, Inner Products (i.e. metrics)

A bilinear form on a vector space \(V\) is a function \[\omega : V \times V \to \mathbb{R}\] satisfying linearity in both variables, staying true to its name, i.e. \[\begin{aligned} && \omega(v_1 + v_2, w) = \omega(v_1, w) + \omega(v_2, w) && \omega(cv, w) = c\omega(v,w)\\ && \omega(v, w_1 + w_2) = \omega(v, w_1) + \omega(v, \omega_2) && \omega(v, cw) = c\omega(v,w) \end{aligned}\] for all \(v,v_1, v_2, w, w_1, w_2 \in V\) and for all \(c \in \mathbb{R}\).

A bilinear form is symmetric if \(\omega(v,w) = \omega(w.v)\), and is called non-degenerate if \(\omega(v,w) = 0\) for all \(w \in V\) implies \(v = 0\).

An inner product satisfies positive definiteness property which is stronger than nondegeneracy, meaning that \(\omega(v,v)>0\) for every nonzero vector \(v \in V\). An inner product is also sometimes called a metric, but usually the term (Riemannian) metric is reserved for a manifold. A Riemannian metric on a manifold \(M\) is a choice of inner product \(\omega_p\) on the vector space \(T_p M\) for every \(p \in M\) and these metrics need to be smooth as one varies the point \(p\).

Remark 2. Just the word metric, without the Riemannian qualifier, is something quite different. It refers to an abstract distance function \(d\) on any set \(X\), satisfying a handful of properties like the triangle inequality.

A bilinear form on a finite dimensional space can be represented by a matrix as follows. Given a basis \(\mathcal{B} = \{e_1, e_2, \ldots, e_n\}\) of \(V\) we look at the Gram-matrix \(F = F_\omega\) where the \(ij\)-th entry is \(\omega(e_i, e_j) = F_{ij}\) then \[\omega(v,w) = \mathbf{\boldsymbol{v}}^\top F \mathbf{\boldsymbol{w}}\] (recall \(\mathbf{\boldsymbol{v}} = [\mathbf{\boldsymbol{v}}]_{\mathcal{B}}\) and \(\mathbf{\boldsymbol{w}} = [\mathbf{\boldsymbol{w}}]_{\mathcal{B}}\)).

If \(\omega\) is symmetric meaning if \(\omega(v,w) = \omega(w,v)\), then \(F\) is symmetric as a matrix. If \(\omega\) is nondegenerate then \(F\) has full rank, i.e. invertible.

Remark 3. There is a quite important distinction between this matrix \(F\) representing a bilinear form, and considering matrix as a linear map. It is not just a philosophical distinction.

If you change the basis of \(V\) to \(\mathcal{B}' = \{f_1, \ldots, f_k\}\) then the bilinear form in the new basis can be represented by the matrix \[P^\top F P\] where \(P\) is the change of basis matrix (with columns as \([\mathbf{\boldsymbol{f_i}}]_\mathcal{B}\)). Indeed \(\omega(v,w) = [\mathbf{\boldsymbol{v}}]_{\mathcal{B}}^\top F [\mathbf{\boldsymbol{w}}]_{\mathcal{B} } = (P[\mathbf{\boldsymbol{v}}]_{\mathcal{B}'})^\top F (P[\mathbf{\boldsymbol{w}}]_{\mathcal{B}'}) = [\mathbf{\boldsymbol{v}}]_{\mathcal{B}'}^\top (P^\top F P) [\mathbf{\boldsymbol{w}}]_{\mathcal{B}' }\). So the same bilinear form, when considered with respect to the basis \(\mathcal{B}'\) would be given by the matrix \(P^\top FP\).

However if \(F\) is considered as a linear map, the same linear map after a change of basis to \(\mathcal{B}'\) would be written with the matrix \[P^{-1}F P.\] We would only have \(P^{-1}= P^\top\) for orthogonal change of bases.

The musical isomorphisms.

Even though the matrix of a bilinear form is not a linear map, one can naturally create a linear map \(\flat : V \to V^\vee\) out of a bilinear form making use of the fact that for every \(v \in V\) the function \(\omega(\cdot, v) \in V^\vee\) and that this correspondence is linear in \(v\). In coordinates: \[\begin{aligned} \flat : V & \longrightarrow V^\vee\\ \mathbf{\boldsymbol{v}}& \longmapsto (F\mathbf{\boldsymbol{v}})^\top. \end{aligned}\] the inverse of which (which exists if \(\omega\) is nondegenerate) is given in coordinates by \[\begin{aligned} \sharp : V^\vee &\longrightarrow V\\ \mathbf{\boldsymbol{\xi}} &\longmapsto F^{-1}\mathbf{\boldsymbol{\xi}}^\top. \end{aligned}\] Here we took \(\xi\) as a row vector to begin with.

So let us keep in mind that, \(V\) and \(V^\vee\) can be identified, but there are many ways to connect them. And identifying \(V\) with \(V^\vee\) as vector spaces using a linear map \(A : V \to V^\vee\) is equivalent to the musical isomorphism with respect to a choice of metric whose Gram matrix is \(A\) (unless you do not choose the bases that were the obvious choice). This all depends on a choice of inner product, and not canonical⁴.

Gradients... Finally!

So a loss function \(\ell\) is a function from the parameter manifold \(\Theta\) to the reals. Specialize to the case \(\Theta = \mathbb{R}^P\) for convenience. The tangent space at a point \(\mathbf{\boldsymbol{\theta}}\) is also congruent to \(\mathbb{R}^P\) and we can choose the standard coordinate basis. The derivative of \(\ell\) at \(\mathbf{\boldsymbol{\theta}} \in \mathbb{R}^P\) is a cotangent vector \[d\ell_{\mathbf{\boldsymbol{\theta}}} \in T_{\mathbf{\boldsymbol{\theta}}}^*\Theta\] i.e. a covector of the tangent space \(T_{\mathbf{\boldsymbol{\theta}}}\Theta\) of the parameter manifold at the point \(\mathbf{\boldsymbol{\theta}}\). Given in the standard dual basis as a row-vector \[[d\ell_{\mathbf{\boldsymbol{\theta}}}] = \begin{bmatrix} \partial_1 \ell (\mathbf{\boldsymbol{\theta}}) & \cdots & \partial_P \ell(\mathbf{\boldsymbol{\theta}}) \end{bmatrix}\] where \(\partial_i= \frac{\partial}{\partial \theta_i}\) is shorthand for the partial derivative with respect to the \(i\)^th coordinate.

The parameter manifold is a Riemannian manifold, each tangent space is congruent to \(\mathbb{R}^P\) with the standard Euclidean inner product (i.e. dot product). The gram matrix \(F\) for this metric, with respect to the standard coordinate basis, and the standard dual basis is simply the identity matrix. In other words, the musical isomorphism \(\sharp\) giving us vectors from covectors is simply the transpose.

The gradient is defined as \(\nabla\ell(\mathbf{\boldsymbol{\theta}}) := \sharp (\,\mathrm{d}\ell_{\mathbf{\boldsymbol{\theta}}})\) and so, in standard coordinate basis it is given as \[[\nabla \ell (\mathbf{\boldsymbol{\theta}}) ] = [\,\mathrm{d}\ell_{\mathbf{\boldsymbol{\theta}}}]^\top = \begin{bmatrix} \partial_1 \ell(\mathbf{\boldsymbol{\theta}}) \\ \vdots\\ \partial_P \ell(\mathbf{\boldsymbol{\theta}}) \end{bmatrix}.\] This is the gradient. When the basis is understood, we will drop the brackets.

After all is said and done, we’re back to what we knew.

Much ado about nothing!

Just kidding, I think we are can now stand taller after our journey to abstractmathland and back again.

This is rather funny and such a mathematician move; one defines a vector as an element of a vector space, and not the other way around. Normal people define things by their constituents. A vector by itself, not belonging to any vector space is no vector.↩︎
Functions are the most fundamental concept in mathematics and fittingly has many names that reflect many of the nuances of the ways they may appear to us, we can call them maps, or transformations, operators; and morphisms as well. Ultimately they are all functions, i.e. \(f : X \to Y\) is a function that connects every element \(x \in X\) to a unique element \(y \in Y\).

If we think of \(x \& y\) as quantities that relate to one another through some formula we simply use the word function, if there is some geometric intuition we can call them maps or mappings (or charts as in the case of patches of a manifold), if the domain and range of the function are the same and the context is geometric then the word transformation is apt to give us the intuition of transforming (stretching and skewing rotating etc.) the input. Operators operate linearly on inputs.

Ultimately they are all functions, and nothing more than functions, so the namings are just psychology. The word morphism is a bit more than a function, it also implies that there is some structure in both the domain and the range that is being preserved, so linear maps (thinking of vectors as geometric thus we use the word map, and linear is an adjective specifying a property of the map) are also called as vector space homomorphisms, if the map goes from the same vector space to itself then it is a vector space endomorphism.

Some of these distinctions are personal, and not written in stone. But I’m just giving you the vibes around this plethora of words meaning essentially the same thing. And if one were to suck all the life out of mathematical writing, one would just call all of them functions.↩︎
But only for the continuous dual. In finite dimensions linear is continuous.↩︎
On the other hand we have a canonical matching \((V^\vee)^\vee \cong V\), without any choice of metric. \((V^\vee)^\vee\) is called the double dual. One simply associates a vector \(v \in V\) to the linear functional \(\operatorname{ev}_v: V^\vee \to \mathbb{R}\) eating linear functionals on \(V\). The value of \(\operatorname{ev}_v\) on a linear functional \(\lambda\) is by evaluation of \(\lambda\) at \(v\), in other words \(\operatorname{ev}_v(\lambda) = \langle \lambda, v\rangle\). Thus \(v \mapsto \operatorname{ev}_v\) goes from \(V\) to the double dual \((V^\vee)^\vee\). And it is a linear bijection for finite dimesional vector spaces.↩︎