Back Propagation

The basic idea of backpropagation is to adjust the network parameters by calculating the error between the output layer and the expected value, so that the error becomes smaller. In fact, the principle of backpropagation is based on only four equations, which will be further introduced later.

Prerequisite knowledge - linear algebra

Gradient matrix

Assuming that function \(f:R^{m\times n}\rightarrow R_{i}\) can map input matrix( shape: \(m\times n\)) to a real number, the gradient of function \(f\) is defined as:

\[\left ( \bigtriangledown_Af\left ( A \right ) \right )_{ij}=\begin{bmatrix} \frac{\partial f\left ( A \right )}{\partial A_{11}} & \frac{\partial f\left ( A \right )}{\partial A_{12}} & \cdots & \frac{\partial f\left ( A \right )}{\partial A_{1n}} & \\ \frac{\partial f\left ( A \right )}{\partial A_{21}} & \frac{\partial f\left ( A \right )}{\partial A_{22}} & \cdots & \frac{\partial f\left ( A \right )}{\partial A_{2n}} & \\ \vdots & \vdots & \ddots & \vdots & \\ \frac{\partial f\left ( A \right )}{\partial A_{m1}} & \frac{\partial f\left ( A \right )}{\partial A_{m2}} & \cdots & \frac{\partial f\left ( A \right )}{\partial A_{mn}} & \end{bmatrix}\]

In other words:

\[\left ( \bigtriangledown_Af\left ( A \right ) \right )_{ij}=\frac{\partial f\left ( A \right )}{\partial A_{ij}}\]

Similarly, a function \(f:R^{n\times 1}\rightarrow R_{i}\) that is a vector( vector generally refers to a colunmn vector, in this case which refers to a column vector by default without special declaration) has:

\[\bigtriangledown_Af\left ( A \right ) = \begin{bmatrix} \frac{\partial f\left ( x \right )}{\partial x_1}\\ \frac{f\left ( x \right )}{\partial x_2}\\ \vdots \\ \frac{f\left ( x \right )}{\partial x_n}\\ \end{bmatrix}\]

The premise of the gradient solution involved here is that function f returns a real number. If the function returns a matrix or a vector, then there is no way to find a gradient. For example, to function\(f\left ( A \right )=\sum_{i=0}^{m}\sum_{j=0}^{n}A_{ij}^{2}\) we can solve the gradient matrix by returning a real number, if \(f\left ( x \right )=Ax\left ( A\epsilon R^{m\times n},x\epsilon R^{n\times 1} \right )\). Since the function returns a vector with \(m\) rows and \(1\) column, the gradient matrix cannot be obtained for \(f\).

By definition, it is easy to get the following properties:

\(\bigtriangledown _{x}\left ( f\left ( x \right ) + g\left ( x \right )\right) =\bigtriangledown_{x}f\left(x\right)+\bigtriangledown_{x}g\left ( x \right )\)
\(\bigtriangledown\left ( tf\left ( x \right ) \right )=t \bigtriangledown f\left ( x \right ), t\epsilon R\)

WIth above knowledge, we can simulate one example:

Defining function \(f:R^{m}\rightarrow R,f\left ( z \right )=z^{T}z\), we can acquire\(\bigtriangledown_{z}f\left ( z \right )=2z\).

Hessian Matrix

Define a function \(f: R^{n}\rightarrow R\) whose input is an n-dimensional vector with a real number output. Then the Hessian Matrix is defined as the square matrix of the second-order partial derivatives of the multivariate funcion \(f\):

\[\bigtriangledown _{x}^{2}f\left ( x \right )=\begin{bmatrix} \frac{\partial^2 f\left ( x \right )}{\partial x_1^2} & \frac{\partial^2 f\left ( x \right )}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 f\left ( x \right )}{\partial x_1\partial x_n} & \\ \frac{\partial^2 f\left ( x \right )}{\partial x_2\partial x_1} & \frac{\partial^2 f\left ( x \right )}{\partial x_2^2} & \cdots & \frac{\partial^2 f\left ( x \right )}{\partial x_2\partial x_n} & \\ \vdots & \vdots & \ddots & \vdots & \\ \frac{\partial^2 f\left ( x \right )}{\partial x_n\partial x_1} & \frac{\partial^2 f\left ( x \right )}{\partial x_n\partial x_2} & \cdots & \frac{\partial^2 f\left ( x \right )}{\partial x_n^2} & \end{bmatrix}\]

It can be withdrown from the equation above that the Hessian Matrix is always a symmetric matrix.

Short Summary

\(b\epsilon R^n, x\epsilon R^n, A\epsilon R^{m\times n}\) while A is one symmetric matrix. \(b,x\) are vectors. Then,

\[\bigtriangledown _{x}b^{T}x=b\] \[\bigtriangledown _{x}x^{T}Ax=2Ax\] \[\bigtriangledown _{x}^{2}x^{T}Ax=2A\]

Matrix product and multiplicaition of corresponding elements

Assume that: \(A=\begin{bmatrix} 1 & 2\\ 3 & 4 \end{bmatrix}\) and \(B=\begin{bmatrix} -1 & -2\\ -3 & -4 \end{bmatrix}\)
Matiex multiplication

\[AB=\begin{bmatrix} 1\times -1+2\times -3 & 1\times -2+2\times -4\\ 3\times -1+4\times -3 & 3\times -2+4\times -4 \end{bmatrix}=\begin{bmatrix} -7 & -10\\ -15 & -22 \end{bmatrix}\]

Principle of gradient descent

What is the meaning of the gradient matrix(vector)? In a geometric sense, the gradient matrix represents the fastest direction of the function, so the minimum can be found faster in the opposite direction.
The process of backpropagation is to use the principle of gradient descent method to slowly find the minimum value of the cost function to get the final model parameters.

Backpropagation principle(four basic equations)

Backpropagation is able to acquisite the way to change the weight \(w\) and the deviation \(b\) in the network to change the cost function value. Ultimately this means that it can calculate partial derivatives \(\frac{\partial L\left ( a^{[l]},y \right )}{\partial \omega _{jk}^{[l]}}\) and \(\frac{\partial L\left ( a^{[l]},y \right )}{\partial b _{j}^{[l]}}\).

In order to calculate these two derivatives, first we introduce an intermediate variable \(\delta ^{_{j}^{[l]}}\) and name it as the error of the \(l^{th}\) layer \(j^{th}\) neuron. Backward propagation can calculate \(\delta ^{_{j}^{[l]}}\) and then corresponds it back to \(\frac{\partial L\left ( a^{[l]},y \right )}{\partial \omega _{jk}^{[l]}}\) and \(\frac{\partial L\left ( a^{[l]},y \right )}{\partial b _{j}^{[l]}}\).

The next problem is the definition of each layer’s error. If we add a perturbation \(\Delta z_{j}^{[l]}\) to the \(j^{th}\) neuron of \(l^{th}\) layer, which can shrink the loss function then it will be a qualified perturbation. By choosing a \(\Delta z_{j}^{[l]}\) which always has an opposite sign compared with \(\delta _{j}^{[l]}=\frac{\partial L\left ( a^{[l]},y \right )}{\partial z _{j}^{[l]}}\), we can always get a good perturbation during each time.

From above knowlddges we define the bias of the \(j^{th}\) neuron of the \(l^{th}\) layer as \(\delta ^{_{j}^{[l]}}\):

\[\delta ^{_{j}^{[l]}}=\frac{\partial L\left ( a^{[l]},y \right )}{\partial z _{j}^{[l]}}\]

Thus, the error vector of each layer can be described as:

\[\delta ^{[l]}=\begin{bmatrix} \delta _{1}^{[l]}\\ \delta _{1}^{[2]}\\ \vdots \\ \delta _{n}^{[l]} \end{bmatrix}\]

Now let us focus on four basic equations mentioned above:

Equation 1:

\(\delta _{j}^{[l]}=\frac{\partial L}{\partial a_{j}^{[L]}} {\sigma}'\left ( z_{j}^{[l]} \right )\) where \(L\) is the number of output layers. If we use \(\delta L\) to represent \(\partial L\left ( a^{[l]},y \right )\), the matrix form of the equation will be

\[\partial ^{[L]}=\bigtriangledown _{a}L\bigodot {\sigma}'\left ( z^{[L]} \right )\]

Equation 2:

\(\delta _{j}^{[l]}=\sum_{k}\omega _{kj}^{[l+1]}\delta _{k}^{[l+1]} {\sigma }'\left ( z_{j}^{[l]} \right )\) The matrix form is:
\(\delta ^{[l]}=\left [ \omega _{[l+1]}\delta _{[l+1]} \right ]\bigodot {\sigma }'\left ( z^{[l]} \right )\)

Equation 3:

\(\frac{\partial L}{\partial b_{j}^{[l]}}=\delta _{j}^{[l]}\) and \(\frac{\partial L}{\partial \omega _{jk}^{[l]}}=a_{k}^{[l-1]}\delta _{j}^{[l]}\)
Or the matrix form:

\(\frac{\partial L}{\partial b_{j}^{[l]}}=\delta ^{[l]}\) and \(\frac{\partial L}{\partial \omega ^{[l]}}=\delta ^{[l]}a^{[l-1]T}\)

Equation 4:

\(b_{j}^{[l]}\leftarrow b_{j}^{[l]}-\alpha \frac{\partial L}{\partial b_{j}^{[l]}}\) and \(\omega _{jk}^{[l]}\leftarrow \omega _{jk}^{[l]}-\alpha \frac{\partial L}{\partial \omega _{jk}^{[l]}}\)

Or the matrix form:

\(b^{[l]}\leftarrow b^{[l]}-\alpha \frac{\partial L}{\partial b^{[l]}}\) and \(\omega ^{[l]}\leftarrow \omega ^{[l]}-\alpha \frac{\partial L}{\partial \omega ^{[l]}}\)

HW6

Microeconometrics