Categories:

Updated:

With many layers, only a subtle change in weights can impact result exponentially. Let’s look at examples.

Let’s assume a very deep neural network with linear activation function. Let’s say that the number of layers is $100$.

Let’s say that all the $W$s are $2I$ where $I$ is a identity matrix. Then, our prediction will be $2^{100}\times X$. Similar thing happens when back propagating. This is called ‘Exploding Gradients’.

Let’s say that all the $W$s are $0.5I$ where $I$ is a identity matrix. Then, our prediction will be $0.5^{100}\times X$ matrix that is very close to 0. Similar thing happens when back propagating. This is called ‘Vanishing Gradients’.

If gradients explode, weights would explode too; gradient descent will overshoot. If gradients vanish, weights would be updated really slowly and eventually stop.

# How to solve it?

## Xavier Initialization

With each passing layer, we want the variance to remain the same. This helps us keep the signal from exploding to a high value or vanishing to zero. In other words, we need to initialize the weights in such a way that the variance remains the same for X and Y.

This is the implementation of Xavier initialization in python numpy.

‘He’ initialization is similar to Xavier initialization, but produces weights in larger range.

## Batch Normalization

Normalizing units in layers helps weights to be in a reasonable range, thus helps prevent exploding/vanishing gradients.

## Using ReLU activation function

Weights itself causes vanishing gradients but activation functions may be a cause of it. For example, if absolute value of $z$ is big, gradient of sigmoid or tanh function vanishes, which causes vanishing gradient.

For more explanations, I’ll quote from https://stats.stackexchange.com/a/240491.

Each gradient update in backprop consists of a number of multiplied factors.

The further you get towards the start of the network, the more of these factors are multiplied together to get the gradient update.

Many of these factors are derivatives of the activation function of the neurons - the rest are weights, biases etc.

If you multiply a bunch of terms which are less than 1, they will tend towards zero. Hence vanishing gradient as you get further from the output layer.

If you multiply a bunch of terms which are greater than 1, they will tend towards infinity, hence exploding gradient as you get further from the output layer.

The ideal then might be to somehow, magically, get these terms contributed by the derivative of the activation functions to be 1. This intuitively means that all the contributions come from the input to the problem and the model - the weights, inputs, biases - rather than some artefact of the activation function chosen.

ReLU has gradient 1 when output > 0, and zero otherwise. Hence multiplying a bunch of RELU derivatives together in the backprop equations has the nice property of being either 1 or zero - the update is either nothing, or takes contributions entirely from the other weights and biases.

Categories: