Importance of Identity Skip Connections
First of all, let's see the importance of identity skip connections. To do that, we will propose a different architecture that does not have identity skip connections.
We modify our residual block as follows:
where is a modulating factor, while keeping is identity function.
Then, we will obtain a generalized function of:
where absorbs the scalers into the residual functions, the gradient function is also known:
We can easily find that the term can be very small or very large, which can cause the gradient to vanish or explode.
To better find out the importance of identity skip connections, following networks are build to compare the performance on CIFAR-10 dataset:
Results are shown below:
We can tell from the result table that some of the models did not converge. And among the converged models, the original model has the best performance. This indicates that the shortcut connections are the most direct paths for the information to propagate. Multiplicative manipulations (scaling, gating, 1 × 1 convolutions, and dropout) on the shortcuts can hamper information propagation and lead to optimization problems.