Deep Learning Demystified

From Optimization to Deep Learning

DL in 1 equation: N — a deep model, θ — its parameters, D — data, T — targets for supervision which can be external labels (supervised learning) or data themselves (unsupervised learning)

1. Introduction

(Source)

Nowadays neural networks boast size: hundreds of layers, trillions of parameters and complex micro-structures. Back in January of this year, Google’s Switch Transformer set a new record for AI language models with 1.6T parameters (8 times larger than the 175B parameters found in OpenAI’s GPT-3 model released last year). Recently, with the release of BAAI’s WuDao 2.0 model, the record was set again to 1.7T. Sounds interesting but a little intimidating, right? You are not alone. If you are one of those AI lovers who are eagerly to become acquainted with this otherwise imperative technology, this blog is for you.

The goal is to demystify the unintentional dilemma by linking Deep Learning (DL) to some basic concepts of calculus so that it appears to you more appealing and less daunting. At the end, all that you need to take home about DL are summarized into three cheat sheets. Never will you be afraid of those gigantic AI models because you know that regardless of their complexities, they all boil down to the three tables that are in your possession.

2. Optimization

First thing to know is that deep learning (machine learning in general) is just another use case of parameter estimation via cost function optimization. If you still have doubts, take a look at the equation on the top — that’s what DL is all about. What it says is that a deep model can be thought of as a function N, parameterized by θ, that given the input data d, generates the output N(d). Training means finding θ by comparing N(d) to the known target t using many data-target pairs (d, t). The differences, a.k.a. errors N(d)-t, are aggregated across the whole dataset D and the target set T. The optimal solution θ* corresponds to the one that minimizes the total error. Thanks to the two magic symbols N and θ, all the scary stuff around DL is fenced off.

In the remainder of this section, I will discuss about function optimization in more details, from simple types to more complex ones that are suitable for expressing Deep Neural Networks (DNN).

2.1 Uni-variate Function Optimization

Let’s start with the simplest case — functions with single variables — y=f(x). Assume the derivative of f is non-zero, i.e. f’(x)≠0 (we’ll see why this condition is required in a minute). Our goal is to find its root (where f(x)=0).

An example is f(x)=x whose answer is trivial: x=0. As a preparation for understanding the later sections, let’s solve it instead as an optimization problem. We take the following steps:

Fig. 2.1 Shape of the cost function C(f(x))=x²/2
  1. Construct a cost function C(f)=f(x)²/2 which is the simplest convex function and guarantees a single minimum.
  2. Perform Gradient Descent (GD) on the newly constructed cost function. When f(x) is a complex function, directly optimizing C(f) is hard. In such cases, iterative gradient descent is the approach. The trick is to approximate f(x) by its 1st order Taylor expansion: f(x)=f(x₀)+f’(x₀)(x-x₀), i.e. at a local vicinity at x₀, replacing f(x) by a line with tangent f’(x₀). Then instead of solving f(x)=0, we solve f(x₀)+f’(x₀)(x-x₀)=0. The found x becomes the x₀ of the next iteration and the process keeps repeating until the minimum is reached. Below is the algorithm,
Univariate function optimization by Gradient Descent. Start from a initial guess x0. Iteratively update it with the calculated ▵x until convergence.

When the iteration stops, the last xₜ is the sought root. Pay attention to three details here:

  • θ is used in lieu of -x, to be consistent with the typical parameter representations of neural networks.
  • The derivative f’(x) unnecessarily appears both in numerator and denominator. The reason will become clear soon when dealing with more complex function types where the concept of derivative changes to gradient.
  • Because f’(x) appears in the denominator, it cannot be zero and that’s what the condition f’(x)≠0 is for.

Question: How do we know that the minimum of C is the same as the root of f(x)?

Answer: It is known that as a convex function, C’s minimum is at its stationary point — the place where C’s derivative vanishes. Following the derivative calculation rule: C’(f)=f(x)f’(x). Thus C’(f)=0 leads to f(x)=0 since f’(x)0. In other words, the newly constructed cost function is pointing us to the same solution to the original problem.

Fig. 2.2 GD example. Green triangles: x sequence; red circles: f(x) sequence; dashed red lines: tangent lines at the data points.

Fig. 2.2 shows the gradient descent when minimizing the above cost function. In this case, ▵θ=f(x)/f’(x)=x/2 which means each iteration halves the previous root found so far. Starting from x=2, the next values are x=1, 0.5, 0.25, … etc. In 10 iterations, x reaches to 1/1⁰²~1e-3 and the cost C reduces to 7.63e-6. Notice that the slope of each red dashed line, equivalent to the derivative or gradient of the cost function at the root x of the moment, keeps descending during the process (hence the name Gradient Descent).

2.2 Multivariate Function Optimization

As its name suggests, multivariate functions involve more than one variables. They look like f(x₀, x₁, …, xₙ). Sometimes they are written in a vector form f(X) where X=[x₀, x₁, …, xₙ]. Since X is multi-dimensional, a single equation f(X)=0 does not suffice a solution. Rather, mn equations are required, i.e. f₀(X)=0, f₁(X)=0, …, fₘ(X)=0. Define F=[f₀, f₁, …, fₘ], a vector function, and [0]ₘ=[0, 0, …, 0], a vector of m zeros. Similar to the univariate case, the multivariate root finding problem F(X)=[0] is converted into an optimization one via a cost function C(F(X))=(F・F)/2 where the symboldenotes vector dot product.

One is wondering what the multivariate equivalent to the univariate derivative is. Well, it turns out to be a matrix called Jacobian matrix which is defined as:

where ▽fᵢ(X)=[fᵢ’(x₀), fᵢ’(x₁), …, fᵢ’(xₙ)] is the gradient of the ith function fᵢ w.r.t. variable X for i=0, 1, …, m. Stacking all these gradient vectors together forms the Jacobian matrix.

Following the similar manipulation of the univariate case, the multivariate optimization process is given below:

Multivariate function optimization by GD. Notice the resemblance to the univariate case.

There are a couple of interesting things to notice from above. First of all, structure wise, the univariate and multivariate processes are exactly the same. Second, just like the requirement f’(x)≠0 earlier, J(X)ᵗ J(X) must be of full rank, i.e., J(X) cannot vanish. Third, the adaptation can be accomplished following these swaps:

  • x → X
  • f→F
  • f’(x)→J(X) or its transpose J(X)
  • 1/(f’(x)f’(x))inverse of J(X)ᵗ J(X)

Fig. 2.3 demonstrates the gradient descent of minimizing a bivariate function f(X)=x₀²+x₁², starting from the top position and ending at the bottom center.

GD of function x₀²+x₁², starting from (-2,2), ending at (0,0), following the red trajectory. (Source)Fig. 2.3 GD of function x₀²+x₁², starting from (-2,2), ending at (0,0), following the red trajectory. (Source)

Structure-from-motion as an example

A 3D point can be recovered from a connection — multiple 2D image pixels connected to that point. (Source)

A classical computer vision problem is Structure-From-Motion , i.e. finding the 3D coordinates of a point (structure) from multiple images of a moving camera (motion). The left figure illustrates the recovering of the 3D coordinates (x, y, z) of a point X (i.e. n=3) on the cup handle from a connection of its three 2D images.

Knowing the camera projection matrices Pᵢ and the point’s 2D image pixels (uᵢ, vᵢ), i=1,2,3, the equation is PᵢX=(uᵢ, vᵢ), or equivalently PᵢX-(uᵢ, vᵢ)=[0, 0]. Each image provides 2 equations. Three images totally provide 6 equations, therefore m=6. The optimization problem is:

Cost function of an example SFM problem: 6 equations, 3 unknowns.

In general, with a unknown cameras and b unknown points, assume all points are observed in all cameras, there are 2ab equations and 6a+3b unknowns. When and m and n are large enough, there are more equations than the unknowns. Hence gradient descent can find the solution. One can imagine taking many photos of a building and recovering its full 3D structure.

In this project, we consider the problem of reconstructing entire cities from images harvested from the web. Our aim is to build a parallel distributed system that downloads all the images associated with a city, say Rome, from Flickr.com. After downloading, it matches these images to find common points and uses this information to compute the three dimensional structure of the city and the pose of the cameras that captured these images. All this to be done in a day.
<a href="https://medium.com/media/e8f88c36956159e698bd755b559eea1f/href">https://medium.com/media/e8f88c36956159e698bd755b559eea1f/href</a>

2.3 Stochastic Gradient Descent (SGD)

There are two shortcomings with Gradient Descent:

  1. It requires non-vanishing gradients (of the m functions) which are hard to guarantee in cases of complex functions. Otherwise numerical instability surfaces during computation.
  2. Inverting J(X)ᵗ J(X) is time consuming (complexity O(mn²+n³)), esp. when m, n are huge numbers as in contemporary neural networks.
Fig. 2.4 Comparison between GD and SGD (source): Colored curves indicate contour lines of the cost function. SGD shows a zig-zag pattern and often settles at a suboptimal solution. But it is computationally much more efficient.

While there are computational tricks to handle the first one, the second one requires new techniques. Among them, Stochastic Gradient Descent (SGD) is most widely used. The idea is to take out the matrix inverse and replace it by a non-zero scalar called learning rate (normally written as ). Since now the update vector ▵X is not optimal anymore, small (<0.01) is chosen to limit the damages of poor updates in the hope that they can be corrected by later updates. Fig. 2.4 is a demonstration of SGD where the convergence trajectory shows a zig-zag pattern. In practice, there are ways to alleviate this sub-optimality which will be given shortly.

SGD: matrix inversion and multiplication replaced by a scalar learning rate

2.4 Mini-batch SGD

Even with the elimination of the inverse, the complexity of computing the matrix-vector multiplication J(X)ᵗ F(X) — O(mn) — can still be high because both m, n can be huge. This is addressed by Mini-batch SGD which uses only a small portion — called mini-batches, denoted by B — of the total m equations. Theoretically B can be as small as 1, i.e. a single equation at a time. In practice, it is chosen to fit the resource availability (such as memories and GPU cores). Mini-batches are selected in random order without repetition and run one-by-one until all equations are consumed. Each full run is called an epoch. An outer loop is added to carry out multiple epochs with decreasing learning rates as getting closer to the root. Below is the Mini-batch SGD:

Mini-batch SGD: small batch size allows for many epochs of partial optimizations

The success of the Mini-batch SGD comes from the much smaller size of B which affords running many epochs of partial optimizations to reach the ultimate solution. Hardware permitting, multiple batches can also executed in parallel, resulting in even more speedup.

3. Deep Learning

With function optimization introduced, from this section, I start to describe how it is used in deep learning. Keep in mind that nowadays deep learning pretty much means neural network training. Also one rarely finds any useful neural networks that are not deep. Hereto, the two phrases — deep learning and neural network — are used interchangeably. Neither will supervised or unsupervised learnings be distinguished because essentially both are thought of as neural network training guided by a cost function.

3.1 Neural Network As Composite Function

Fig. 3.1 A two-layer neural network: 3 input, 2 hidden layers, 1 output

Fig. 3.1 at left is a simple neural network with 3 inputs (red), two hidden layers of neurons (blue) and 1 output layer (green). Circles represent neurons. Each has an activation function in it. Arrows between two neighbor layers represent synapses which have weights and are encoded into a matrix that transforms values from the input neurons to the output. Below is a functional view of the same network.

Fig. 3.2 Functional picture of the same neural network above

Specifically, data input X is a 3-vector. W₁ is a 4✕3 matrix (note columns for input and rows for output), W₂ and W₃ are respectively 4✕4 and 1✕4. F₁ and F₂ are both 4-element vector functions, called activation functions. F₃ is 1-element. The cost function C is also 1-element.

Generalization of the representation

One may raise the question: Where’s the CNN (Convolutional Neural Network)?

Answer: Actually CNN is just a special case of the above structure where the transformation matrices W are in diagonally banded forms. The non-zero elements of each row consist the convolution kernel.

A second question is: How about CNN for images?

Answer: Again it is a special case of the above. One can think of a 2D image as a flatten 1D vector. Then the image convolution kernel can be encoded into the rows of W.

Finally, there is the questions: Where’re the bias terms?

Answer: That can be handled by using the concept of homogeneous coordinate which is very common in computer graphics — appending a constant activation function of value 1 to the end of each vector function.

Now comes the first interesting point — writing a neural network as a composite function. Here are some examples:

  • Single layer — C(F(WX))
  • Two layers — C(F₂(WF₁(WX)))
  • Three layers — C(F₃(WF₂(WF₁(WX))))
  • L layers — C(Fₗ(WₗFₗ-₁(…(WF₁(WX)))))

Evaluation of such a composite function happens from inside to outside. For example, C(F₂(WF₁(WX))) means X is first multiplied with W₁, then fed to the activation function F₁. The result is multiplied with W₂ and fed to F₂. Finally the latter is fed to the cost function C.

Now that a neural network is written as a composite function, all that needs to derive is a method of calculating its gradient w.r.t the network parameters, or matrix weights, before applying SGD (or Mini-batch SGD) to solve for them. But before that, some necessary tools need to be put in place, namely, vector outer product and element product.

Vector outer product

Let me use an example to illustrate what an outer product is. Given two vectors A=[a₁, a₂, a₃] and B=[b₁, b₂], their outer product ⊗ is constructed like this: taking the first vector as a column and the second one as a row, and combining them into a matrix with per element multiplication.

|A|=m, |B|=n. Outer product A⊗B is a matrix of size mxn

Vector element product

It is defined between two vectors of the same length. The output is also of the same length.

|A| = |B|. Element product A⨀B is a vector of the same length

3.2 Gradient of Neural Network — Chain Rule

Fig. 3.3 A double layer network

Now comes the second interesting and one of the most important point of this blog — chain rule — which allows the computation of the partial gradients of the cost function w.r.t. any of the parameters. For example, in Fig. 3.3, the parameters are the two matrices V and W. The cost function is C. Chain rule allows computing ▽Cᵥ, C, the partial gradients of C against V and W.

Without diving too much into the technical details, let’s see the result first. Notice how the element product and outer product are used here. A second thing to notice is the recursive nature of the formula which is the basis of the back-propagation to be presented later.

When understanding chain rule in the context of neural network, a key concept is error propagation. During network training, the output is first compared to the target and their difference, ▽C=output-target, is determined then propagated towards the input to calculate the parameter gradients as they are encountered on the way of reversal.

Reference to Fig. 3.3 and imagine▽C’s first backstop right before F. There, it changes to ϵ₁=▽F☉▽C. The partial gradient is ▽C=ϵ₁⊗Gᵗ. At the second backstop which is right before G, the error becomes ϵ₂=▽G☉[Wᵗϵ₁] which gives the partial gradient ▽Cᵥ=ϵ₂⊗Xᵗ.

The chain rule of a general N-layer network is summarized below. More details on the derivation can be found in this excellent tutorial.

Chain rule of a general L-layer neural network

3.3 Back-propagation

Given the general chain rule, it is not hard to attain the back-propagation method for network gradient calculation which is the third interesting thing here. During the forward pass, all layer output and activation gradients are cached. At a step of back-propagation, errors are first transformed by the transpose of the weight matrix, then scaled by the gradient of the activation function of the previous layer. After that, the partial parameter gradient is obtained as the outer product between the updated error vector and the cached output.

Forward and backward propagation based on chain rule

3.4 Training of Neural Network Using Mini-batch SGD

We now have all the tools at hand to train a neural network:

Neural network training using Mini-batch SGD

Fig. 3.4 is a typical SGD cost reduction curve. The fluctuations are consistent with the zig-zag behavior in the trajectory shown earlier in Fig. 2.4. There is a reason for this to happen. How to handle this defect leads to the next topic in terms of improving the performance of SGD based training.

Fig. 3.4 SGD convergence curve (source) — fluctuation as a result of gradient step zip-zag

4. Optimizers

Fig. 4.1 Wiggling appears in a uneven error surface (source)

The wiggle behavior of SGD is because the error surface of the extremely high dimensional cost function is anything but uniform. The images in Fig. 4.1 exhibit this phenomena where the gradient non-uniformity results in the step volatility. Obviously modification is needed for the update step ⍺▽⍬ on either the learning rate or the direction, or both. That’s the job of the different optimizers introduced in this section. A thorough online tutorial can found here.

4.1 Momentum

Fig. 4.2 Momentum uses historical average of gradients as update direction as shown by the smooth blue trajectory (source)

As the previous analysis shows, there are some challenges facing SGD:

  • If using a small learning rate, convergence is slow
  • If using a large learning rate, convergence is fluctuate or even divergence
  • Additionally one has to deal with saddle points where the gradients are close to zero which implies the iteration updates stop at local minima.

The idea is actually straightforward — replacing the current gradient by the average of all historical gradients. The average acts like low-pass filtering that smooths out the fluctuations. At the same time, saddle points are surpassed by the momentum (hence the name of the method). Accumulation of the momentum is done by Vₜ+₁=βV+(1)▽C where β is a decay factor. β is usually set to 0.9 or a larger value which means at each step, 90% or more of the update comes from the momentum. Current gradient only contributes 10% or less. At a saddle point, even though the gradient is 0, update continues at 90% of the previous momentum. Once the saddle point is passed through, new gradients becomes available and the training gets a fresh speed.

Update formulae of Momentum via historical gradient average

4.2 AdaGrad

Momentum only changes the directional component▽⍬ but not the learning rate. Sometimes, it is desirable to adopt per-parameter rates such that sparser parameters (flatter error region) move faster than less sparser ones. That is achieved by AdaGrad (for Adaptive Gradient) by adding per-parameter scaling inversely proportional to the accumulated length of the corresponding component of each gradient.

Update formulae of AdaGrad — index i loops through each component of W. ϵ is a small number to avoid division by zero, usually on the order of 1e−8.

In the above equation, the accumulation at the denominator tries to achieve a similar effect of Momentum’s historical average.

One of AdaGrad’s benefits is that it eliminates the need to periodically adjust the learning rate . Most implementations use a default value of 0.01 and leave it unchanged throughout the epochs.

4.3 RMSProp

AdaGrad’s main weakness is its accumulation of the squared gradients in the denominator — it keeps growing during training — causing the learning rate to shrink and eventually become infinitesimally small, at which point training can no longer progress. RMSProp (Root Mean Square Propagation) resolves this flaw by restricting a window size of past gradients.

Update formulae of RMSProp via the introduction of a decay factor, similar to that of Momentum

It achieves this effect in a clever way without explicitly storing multiple gradients by using a decay factor, just as in Momentum. Since β<1, after many steps, the accumulated weights of the much earlier momenta become so small that their contributions are in effect negligent, sort of like a soft window. RMSProp inherits historical averaging from Momentum and per-parameter scaling from AdaGrad therefore possesses the benefits of both.

4.4 Adam

Finally, Adaptive Moment Estimation combines the advantages of all the previous methods, namely, decay, momentum and per-parameter scaling.

Update formulae of Adam

As V and E are initialized to 0 vectors, the initial update steps can be too small, especially when the learning rates are small (i.e. β₁ and β₂ are close to 1). The two middle equations above are designed to counteract this effect. At early stages when t is small, 1/(1-β) and 1/(1-β) are relatively large. Thus both V and E are enlarged. At later stages, both scales approach to 1, reducing the correction to almost none. The proposed default values are 0.9 for β₁, 0.999 for β₂, and 1e−8 for ϵ.

Fig. 4.3 Comparison of multiple optimizer’s performances (Source)

Fig 4.3 on the right are a comparison of some of the optimizers just introduced. It shows that Adam has the best performance. In particular, the fluctuations seen in the native SGD are largely relieved here.

5. All Maths About DL Are In Three Tables

This blog intends to peel off the convoluted and sometimes intimidating perception on deep learning. It builds a linkage from the basic calculus concepts, namely, derivative and gradient to the latest DL algorithms. It explains the reasons why Mini-batch SGD is used in place of the naive gradient descent as well as the motivations behind the designs of the optimizers that improve the training performance, addressing one issue at a time. Finally, to simply things even more, all maths about DL involved here are packed into the three tables below for you to take away.

5.1 Functional Optimization

Notice the change from derivative to Jacobian, and appearances of mini-batch and learning rate.

5.2 Back-propagation

Notice the use of caching in the forward pass for the later back-propagation, and the elegant structure.

5.3 Optimizers

Notice the migration of the formulae (left to right) and recall what issue each aims to address.

Deep Learning Demystified was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Link: Deep Learning Demystified. From Optimization to Deep Learning | by George Chen | Walmart Global Tech Blog | Medium