From Optimization to Deep Learning

DL in 1 equation:*N*— a deep model,

*θ*— its parameters, D — data, T — targets for supervision which can be external labels (supervised learning) or data themselves (unsupervised learning)

### 1. Introduction

(Source)Nowadays neural networks boast size: hundreds of layers, trillions of parameters and complex micro-structures. Back in January of this year, Google’s Switch Transformer set a new record for AI language models with 1.6T parameters (8 times larger than the 175B parameters found in OpenAI’s GPT-3 model released last year). Recently, with the release of BAAI’s WuDao 2.0 model, the record was set again to 1.7T. Sounds interesting but a little intimidating, right? You are not alone. If you are one of those AI lovers who are eagerly to become acquainted with this otherwise imperative technology, this blog is for you.

The goal is to demystify the unintentional dilemma by linking Deep Learning (DL) to some basic concepts of calculus so that it appears to you more appealing and less daunting. At the end, all that you need to take home about DL are summarized into three cheat sheets. Never will you be afraid of those gigantic AI models because you know that regardless of their complexities, they all boil down to the three tables that are in your possession.

### 2. Optimization

First thing to know is that deep learning (machine learning in general) is just another use case of parameter estimation via cost function optimization. If you still have doubts, take a look at the equation on the top — that’s what DL is all about. What it says is that a deep model can be thought of as a function *N, *parameterized by *θ,* that given the input data *d*, generates the output *N*(*d*). Training means finding *θ* by comparing *N*(*d*) to the known target *t* using many data-target pairs (*d*, *t*). The differences, *a.k.a. *errors *N*(*d*)-*t*, are aggregated across the whole dataset *D* and the target set *T. *The optimal solution *θ** corresponds to the one that minimizes the total error. Thanks to the two magic symbols *N* and *θ*, all the scary stuff around DL is fenced off*.*

In the remainder of this section, I will discuss about function optimization in more details, from simple types to more complex ones that are suitable for expressing Deep Neural Networks (DNN).

**2.1 Uni-variate Function Optimization**

Let’s start with the simplest case — functions with single variables — *y*=*f*(*x*). Assume the derivative of *f* is non-zero, i.e. *f’*(*x*)≠0 (we’ll see why this condition is required in a minute). Our goal is to find its root (where *f*(*x*)=0).

An example is *f*(*x*)=*x* whose answer is trivial: *x*=0. As a preparation for understanding the later sections, let’s solve it instead as an optimization problem. We take the following steps:

- Construct a
**cost function***C*(*f*)=*f*(*x*)²/2 which is the simplest convex function and guarantees a single minimum. - Perform
**Gradient Descent**(GD) on the newly constructed cost function. When*f*(*x*) is a complex function, directly optimizing*C*(*f*) is hard. In such cases, iterative gradient descent is the approach. The trick is to approximate*f*(*x*) by its 1st order Taylor expansion:*f*(*x*)=*f*(*x*₀)+*f’*(*x*₀)(*x*-*x*₀), i.e. at a local vicinity at*x*₀, replacing*f*(*x*) by a line with tangent*f’*(*x*₀). Then instead of solving*f*(*x*)=0, we solve*f*(*x*₀)+*f’*(*x*₀)(*x*-*x*₀)=0. The found*x*becomes the*x*₀ of the next iteration and the process keeps repeating until the minimum is reached. Below is the algorithm,

When the iteration stops, the last *xₜ* is the sought root. Pay attention to three details here:

- ▵
*θ i*s used in lieu of*-*▵*x*, to be consistent with the typical parameter representations of neural networks. - The derivative
*f’*(*x*) unnecessarily appears both in numerator and denominator. The reason will become clear soon when dealing with more complex function types where the concept of derivative changes to gradient. - Because
*f’*(*x*) appears in the denominator, it cannot be zero and that’s what the condition*f’*(*x*)≠0 is for.

**Question**: How do we know that the minimum of C is the same as the root of f*(*x*)*?

**Answer**: It is known that as a convex function, C’s minimum is at its stationary point — the place where C’s derivative vanishes. Following the derivative calculation rule: C’*(*f*)*=f*(*x*)*f’*(*x*). Thus *C’*(*f*)*=*0 leads to *f*(*x*)=0 s*ince f’*(*x*)*≠*0*. In other words, the newly constructed cost function is pointing us to the same solution to the original problem.

Fig. 2.2 shows the gradient descent when minimizing the above cost function. In this case, ▵*θ=f*(*x*)/*f’*(*x*)*=x*/2 which means each iteration halves the previous root found so far. Starting from *x*=2, the next values are *x*=1, 0.5, 0.25, … etc. In 10 iterations, *x* reaches to 1/1⁰²~1e-3 and the cost *C* reduces to 7.63e-6. Notice that the slope of each red dashed line, equivalent to the derivative or gradient of the cost function at the root *x* of the moment, keeps descending during the process (hence the name Gradient Descent).

**2.2 Multivariate Function Optimization**

As its name suggests, multivariate functions involve more than one variables. They look like *f*(*x*₀, *x*₁, …, *xₙ*). Sometimes they are written in a vector form *f*(*X*) where *X*=[*x*₀, *x*₁, …, *xₙ*]. Since *X* is multi-dimensional, a single equation *f*(*X*)=0 does not suffice a solution. Rather, *m*≥*n* equations are required, i.e. *f*₀(*X*)=0, *f*₁(*X*)=0, …, *fₘ*(*X*)=0. Define *F*=[*f*₀, *f*₁, …, *fₘ*], a vector function, and [0]*ₘ=*[0, 0, …, 0], a vector of *m* zeros*.* Similar to the univariate case, the multivariate root finding problem *F*(*X*)=[0]*ₘ* is converted into an optimization one via a cost function *C*(*F*(*X*))=(*F・F*)/2 where the symbol*・*denotes vector dot product.

One is wondering what the multivariate equivalent to the univariate derivative is. Well, it turns out to be a matrix called **Jacobian matrix **which is defined as:

where ▽*fᵢ*(*X*)=[*fᵢ*’(*x*₀), *fᵢ*’(*x*₁), …, *fᵢ’*(*xₙ*)] is the gradient of the ith function *fᵢ *w.r.t. variable *X *for* i=*0, 1, …,* m*. Stacking all these gradient vectors together forms the Jacobian matrix.

Following the similar manipulation of the univariate case, the multivariate optimization process is given below:

Multivariate function optimization by GD. Notice the resemblance to the univariate case.There are a couple of interesting things to notice from above. First of all, structure wise, the univariate and multivariate processes are exactly the same. Second, just like the requirement *f’*(*x*)≠0 earlier, *J*(*X*)*ᵗ J*(*X*) must be of full rank, i.e., *J*(*X*) cannot vanish. Third, the adaptation can be accomplished following these swaps:

*x → X**f→F**f*’(*x*)*→J*(*X*) or its transpose*J*(*X*)*ᵗ**1/(f*’(*x*)*f*’(*x*))*→*inverse of*J*(*X*)*ᵗ J*(*X*)

Fig. 2.3 demonstrates the gradient descent of minimizing a bivariate function *f*(*X*)=*x*₀²+*x*₁², starting from the top position and ending at the bottom center.

*x*₀²+

*x*₁², starting from (-2,2), ending at (0,0), following the red trajectory. (Source)

*Structure-from-motion as an example*

A classical computer vision problem is Structure-From-Motion , i.e. finding the 3D coordinates of a point (structure) from multiple images of a moving camera (motion). The left figure illustrates the recovering of the 3D coordinates (*x*, *y*, *z*) of a point *X *(i.e. *n*=3) on the cup handle from a connection of its three 2D images.

Knowing the camera projection matrices *Pᵢ* and the point’s 2D image pixels (*uᵢ*, *vᵢ*), *i*=1,2,3, the equation is *PᵢX*=(*uᵢ*, *vᵢ*), or equivalently *PᵢX-*(*uᵢ*, *vᵢ*)=[0, 0]. Each image provides 2 equations. Three images totally provide 6 equations, therefore *m*=6. The optimization problem is:

In general, with *a* unknown cameras and *b* unknown points, assume all points are observed in all cameras, there are 2*ab* equations and 6*a*+3*b* unknowns. When and *m* and *n* are large enough, there are more equations than the unknowns. Hence gradient descent can find the solution. One can imagine taking many photos of a building and recovering its full 3D structure.

In this project, we consider the problem of reconstructing entire cities from images harvested from the web. Our aim is to build a parallel distributed system that downloads all the images associated with a city, say Rome, from Flickr.com. After downloading, it matches these images to find common points and uses this information to compute the three dimensional structure of the city and the pose of the cameras that captured these images. All this to be done in a day.<a href="https://medium.com/media/e8f88c36956159e698bd755b559eea1f/href">https://medium.com/media/e8f88c36956159e698bd755b559eea1f/href</a>

**2.3 Stochastic Gradient Descent (SGD)**

There are two shortcomings with Gradient Descent:

- It requires non-vanishing gradients (of the
*m*functions) which are hard to guarantee in cases of complex functions. Otherwise numerical instability surfaces during computation. - Inverting
*J*(*X*)*ᵗ J*(*X*) is time consuming (complexity O(*mn*²+*n*³)), esp. when*m*,*n*are huge numbers as in contemporary neural networks.

While there are computational tricks to handle the first one, the second one requires new techniques. Among them, Stochastic Gradient Descent (SGD) is most widely used. The idea is to take out the matrix inverse and replace it by a non-zero scalar called **learning rate **(normally written as *⍺*). Since now the update vector ▵*X *is not optimal anymore, small *⍺ *(<0.01) is chosen to limit the damages of poor updates in the hope that they can be corrected by later updates. Fig. 2.4 is a demonstration of SGD where the convergence trajectory shows a zig-zag pattern. In practice, there are ways to alleviate this sub-optimality which will be given shortly.

**2.4 Mini-batch SGD**

Even with the elimination of the inverse, the complexity of computing the matrix-vector multiplication *J*(*X*)*ᵗ F*(*X*) — O(*mn*) — can still be high because both *m, n *can be huge. This is addressed by **Mini-batch SGD** which uses only a small portion — called *mini-batches*, denoted by *B* — of the total *m* equations. Theoretically *B* can be as small as 1, i.e. a single equation at a time. In practice, it is chosen to fit the resource availability (such as memories and GPU cores). Mini-batches are selected in random order without repetition and run one-by-one until all equations are consumed. Each full run is called an **epoch**. An outer loop is added to carry out multiple epochs with decreasing learning rates as getting closer to the root. Below is the Mini-batch SGD:

The success of the Mini-batch SGD comes from the much smaller size of *B* which affords running many epochs of partial optimizations to reach the ultimate solution. Hardware permitting, multiple batches can also executed in parallel, resulting in even more speedup.

### 3. Deep Learning

With function optimization introduced, from this section, I start to describe how it is used in deep learning. Keep in mind that nowadays deep learning pretty much means neural network training. Also one rarely finds any useful neural networks that are not deep. Hereto, the two phrases — *deep learning* and *neural network — *are used interchangeably. Neither will supervised or unsupervised learnings be distinguished because essentially both are thought of as neural network training guided by a cost function.

**3.1 Neural Network As Composite Function**

Fig. 3.1 at left is a simple neural network with 3 inputs (red), two hidden layers of neurons (blue) and 1 output layer (green). Circles represent neurons. Each has an activation function in it. Arrows between two neighbor layers represent synapses which have weights and are encoded into a matrix that transforms values from the input neurons to the output. Below is a functional view of the same network.

Fig. 3.2 Functional picture of the same neural network aboveSpecifically, data input *X* is a 3-vector. *W*₁ is a 4✕3 matrix (note columns for input and rows for output), *W*₂ and *W*₃ are respectively 4✕4 and 1✕4. *F*₁ and *F*₂ are both 4-element vector functions, called activation functions. *F*₃ is 1-element. The cost function *C* is also 1-element.

*Generalization of the representation*

One may raise the question: *Where’s the CNN* (Convolutional Neural Network)?

**Answer**: Actually CNN is just a special case of the above structure where the transformation matrices *W* are in diagonally banded forms. The non-zero elements of each row consist the convolution kernel.

A second question is: *How about CNN for images*?

**Answer**: Again it is a special case of the above. One can think of a 2D image as a flatten 1D vector. Then the image convolution kernel can be encoded into the rows of *W*.

Finally, there is the questions: *Where’re the bias terms*?

**Answer**: That can be handled by using the concept of homogeneous coordinate which is very common in computer graphics — appending a constant activation function of value 1 to the end of each vector function.

Now comes the **first** interesting point — writing a neural network as a composite function. Here are some examples:

- Single layer —
*C*(*F*(*WX*)) - Two layers —
*C*(*F*₂(*W*₂*F*₁(*W*₁*X*))) - Three layers —
*C*(*F*₃(*W*₃*F*₂(*W*₂*F*₁(*W*₁*X*)))) *L*layers —*C*(*Fₗ*(*WₗFₗ*-₁(…(*W*₂*F*₁(*W*₁*X*)))))

Evaluation of such a composite function happens from inside to outside. For example, *C*(*F*₂(*W*₂*F*₁(*W*₁*X*))) means *X* is first multiplied with *W*₁, then fed to the activation function *F*₁. The result is multiplied with *W*₂ and fed to *F*₂. Finally the latter is fed to the cost function *C*.

Now that a neural network is written as a composite function, all that needs to derive is a method of calculating its gradient w.r.t the network parameters, or matrix weights, before applying SGD (or Mini-batch SGD) to solve for them. But before that, some necessary tools need to be put in place, namely, vector outer product and element product.

*Vector outer product*

Let me use an example to illustrate what an outer product is. Given two vectors *A*=[*a*₁, *a*₂, *a*₃] and *B*=[*b*₁, *b*₂], their outer product ⊗ is constructed like this: taking the first vector as a column and the second one as a row, and combining them into a matrix with per element multiplication.

*A*|=

*m,*|

*B*|=

*n. O*uter product A⊗B is a matrix of size

*m*x

*n*

*Vector element product*

It is defined between two vectors of the same length. The output is also of the same length.

|*A*| = |

*B*|. Element product A⨀B is a vector of the same length

**3.2 Gradient of Neural Network — Chain Rule**

Now comes the **second** interesting and one of the most important point of this blog — chain rule — which allows the computation of the partial gradients of the cost function w.r.t. any of the parameters. For example, in Fig. 3.3, the parameters are the two matrices *V* and *W*. The cost function is *C*. Chain rule allows computing ▽*Cᵥ, *▽*C*, the partial gradients of *C* against *V* and *W*.

Without diving too much into the technical details, let’s see the result first. Notice how the element product and outer product are used here. A second thing to notice is the recursive nature of the formula which is the basis of the back-propagation to be presented later.

When understanding chain rule in the context of neural network, a key concept is *error propagation*. During network training, the output is first compared to the target and their difference, ▽*C=*output-target, is determined then propagated towards the input to calculate the parameter gradients as they are encountered on the way of reversal.

Reference to Fig. 3.3 and imagine▽*C*’s first backstop right before *F. *There, it changes to *ϵ*₁=▽*F*☉▽*C*. The partial gradient is ▽*C=ϵ*₁⊗*Gᵗ. *At the second backstop which is right before *G*, the error becomes *ϵ*₂=▽*G*☉[*Wᵗϵ*₁] which gives the partial gradient ▽*Cᵥ=ϵ*₂⊗*Xᵗ.*

The chain rule of a general *N-*layer network is summarized below. More details on the derivation can be found in this excellent tutorial.

**3.3 Back-propagation**

Given the general chain rule, it is not hard to attain the back-propagation method for network gradient calculation which is the **third** interesting thing here. During the forward pass, all layer output and activation gradients are cached. At a step of back-propagation, errors are first transformed by the transpose of the weight matrix, then scaled by the gradient of the activation function of the previous layer. After that, the partial parameter gradient is obtained as the outer product between the updated error vector and the cached output.

**3.4 Training of Neural Network Using Mini-batch SGD**

We now have all the tools at hand to train a neural network:

Neural network training using Mini-batch SGDFig. 3.4 is a typical SGD cost reduction curve. The fluctuations are consistent with the zig-zag behavior in the trajectory shown earlier in Fig. 2.4. There is a reason for this to happen. How to handle this defect leads to the next topic in terms of improving the performance of SGD based training.

Fig. 3.4 SGD convergence curve (source) — fluctuation as a result of gradient step zip-zag### 4. Optimizers

Fig. 4.1 Wiggling appears in a uneven error surface (source)The wiggle behavior of SGD is because the error surface of the extremely high dimensional cost function is anything but uniform. The images in Fig. 4.1 exhibit this phenomena where the gradient non-uniformity results in the step volatility. Obviously modification is needed for the update step *⍺▽⍬* on either the learning rate or the direction, or both. That’s the job of the different optimizers introduced in this section. A thorough online tutorial can found here.

**4.1 Momentum**

As the previous analysis shows, there are some challenges facing SGD:

- If using a small learning rate, convergence is slow
- If using a large learning rate, convergence is fluctuate or even divergence
- Additionally one has to deal with saddle points where the gradients are close to zero which implies the iteration updates stop at local minima.

The idea is actually straightforward — replacing the current gradient by the average of all historical gradients. The average acts like low-pass filtering that smooths out the fluctuations. At the same time, saddle points are surpassed by the **momentum** (hence the name of the method). Accumulation of the momentum is done by *Vₜ*+₁*=βV*ₜ*+*(1*-β*)▽*C *where *β* is a decay factor. *β *is usually set to 0.9 or a larger value which means at each step, 90% or more of the update comes from the momentum. Current gradient only contributes 10% or less. At a saddle point, even though the gradient is 0, update continues at 90% of the previous momentum. Once the saddle point is passed through, new gradients becomes available and the training gets a fresh speed.

**4.2 AdaGrad**

Momentum only changes the directional component*▽⍬ *but not the learning rate. Sometimes, it is desirable to adopt per-parameter rates such that sparser parameters (flatter error region) move faster than less sparser ones. That is achieved by AdaGrad (for **Ada**ptive **Grad**ient) by adding per-parameter scaling inversely proportional to the accumulated length of the corresponding component of each gradient.

*ϵ is a small number to avoid division by zero,*usually on the order of 1e−8.

In the above equation, the accumulation at the denominator tries to achieve a similar effect of Momentum’s historical average.

One of AdaGrad’s benefits is that it eliminates the need to periodically adjust the learning rate *⍺*. Most implementations use a default value of 0.01 and leave it unchanged throughout the epochs.

**4.3 RMSProp**

AdaGrad’s main weakness is its accumulation of the squared gradients in the denominator — it keeps growing during training — causing the learning rate to shrink and eventually become infinitesimally small, at which point training can no longer progress. RMSProp (**R**oot **M**ean **S**quare **Prop**agation) resolves this flaw by restricting a window size of past gradients.

It achieves this effect in a clever way without explicitly storing multiple gradients by using a decay factor, just as in Momentum. Since *β<*1, after many steps, the accumulated weights of the much earlier momenta become so small that their contributions are in effect negligent, sort of like a soft window. RMSProp inherits historical averaging from Momentum and per-parameter scaling from AdaGrad therefore possesses the benefits of both.

**4.4 Adam**

Finally, **Ada**ptive **M**oment Estimation combines the advantages of all the previous methods, namely, decay, momentum and per-parameter scaling.

As *V* and *E* are initialized to 0 vectors, the initial update steps can be too small, especially when the learning rates are small (i.e. *β*₁ and *β*₂ are close to 1). The two middle equations above are designed to counteract this effect. At early stages when *t* is small, 1/(1-*β*₁*ᵗ*) and 1/(1-*β*₂*ᵗ*) are relatively large. Thus both *V* and *E *are enlarged. At later stages, both scales approach to 1, reducing the correction to almost none. The proposed default values are 0.9 for *β*₁, 0.999 for *β*₂, and 1e−8 for *ϵ*.

Fig 4.3 on the right are a comparison of some of the optimizers just introduced. It shows that Adam has the best performance. In particular, the fluctuations seen in the native SGD are largely relieved here.

### 5. All Maths About DL Are In Three Tables

This blog intends to peel off the convoluted and sometimes intimidating perception on deep learning. It builds a linkage from the basic calculus concepts, namely, derivative and gradient to the latest DL algorithms. It explains the reasons why Mini-batch SGD is used in place of the naive gradient descent as well as the motivations behind the designs of the optimizers that improve the training performance, addressing one issue at a time. Finally, to simply things even more, all maths about DL involved here are packed into the three tables below for you to take away.

**5.1 Functional Optimization**

**5.2 Back-propagation**

**5.3 Optimizers**

Deep Learning Demystified was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Link: Deep Learning Demystified. From Optimization to Deep Learning | by George Chen | Walmart Global Tech Blog | Medium