Analysis of Deep Residual Networks
Recall that the residual block is defined as:
Here is the input feature to the l-th Residual Unit. is a set of weights (and biases) associated with the l-th Residual Unit, and is the number of layers in a Residual Unit. denotes the residual function, e.g., a stack of two 3×3 convolutional layers in. The function is the operation after element-wise addition, and in is ReLU. The function is set as an identity mapping: .
If is an identity mapping: , then we can have:
and if we generalize this to layers, we have:
This generalized equation has those properties:
- The feature of any deeper unit can be represented as the feature of any shallower unit plus a residual function in a form of
- The feature , of any deep unit , is the summation of the outputs of all preceding residual functions.
And with this equation, we can also have a nice backward propagation property:
This suggests that the gradient can be decomposed into two parts:
- a term of that propagates information directly without concerning any weight layers, ensures that information is directly propagated back to any shallower unit
- a term of that propagates through the weight layers.
Note that the gradient is not likely to vanish, as the term is not likely to always be -1, which is a very good attribute.
Those equations suggest that the signal can be directly propagated from any unit to another, both forward and backward. Actually, two important conditions are required to ensure the above properties:
- the identity skip connection
- f is an identity mapping (Relu in this case)
In the next chapters, we will dive deeper into those two conditions.