# Initialization of Weights

Categories:

Updated:

We need initial weights to start gradient descent, just like we need to be somewhere on a mountain to descend from it.

# First Problem

Let’s think of one layer in neural network and say that $W$ is a weight matrix of that layer and $n$ is number of units of that layer.

Say we initialize weight so that all $W_i$(ith column of $W$)s for $i=1,…,n$ are the same.

Then by forward propagation $Z^{[l]} = A^{[l-1]}W^{[l]}+b^{[l]}$, all $Z^{[l]}_i$(ith column of $Z^{[l]}$)s will be the same.

All $A^{[l]}_i$s will be the same too.

With backpropagation

columns of $dW^{[l]}$ s will be duplicates.

Thus even after gradient descents, $W_i$s will be same and $A_i$s, too. This is a waste of units.

### Example

Let’s assume a binary classification neural network with one hidden layer.Number of examples $m=3$ number of features $n=2$, number of units in hidden layers $n^{}=3$.

Now let’s initialize $W$ so that columns are duplicates.

As activation function $g$ is applied element-wise, it doesn’t change the symmetry of a matrix, so let’s just assume $g$ is a identity function.

We forward propagate.

We backpropagate. We know that $\hat{Y}$ is composed of different values. So $dZ^{}$ is composed of different values. Let’s just assume like follows.

We can see that columns of $dW^{}, dW^{}$ are duplicates.

# Second Problem

If we initialize weights to large numbers, $Z$ will be large, thus gradient of sigmoid or tanh function will be close to 0. If gradients of activation functions are close to 0, gradient descent will be slower.

# Conclusion

To prevent these two problems from happening, we initialize weights randomly to values close to 0.

Categories: