Advanced Topics
New Design
Analysis of Deep Residual Networks

Analysis of Deep Residual Networks

Recall that the residual block is defined as:

yl=h(xl)+F(xl,Wl)xl+1=f(yl)y_l = h(x_l) + \mathcal{F}(x_l, W_l) \\ x_{l+1} = f(y_l)

Here xlx_l is the input feature to the l-th Residual Unit. Wl={Wl,k1kK}W_l = \{W_{l,k} |_{1 \le k \le K}\} is a set of weights (and biases) associated with the l-th Residual Unit, and KK is the number of layers in a Residual Unit. F\mathcal{F} denotes the residual function, e.g., a stack of two 3×3 convolutional layers in. The function ff is the operation after element-wise addition, and in ff is ReLU. The function hh is set as an identity mapping: h(xl)=xlh(x_l) = x_l.

If ff is an identity mapping: xl+1=ylx_{l+1} = y_l, then we can have:

xl+1=xl+F(xl,Wl)x_{l+1} = x_l + \mathcal{F}(x_l, W_l)

and if we generalize this to LL layers, we have:

xL=xl+i=lL1F(xi,Wi)x_L = x_l + \sum_{i=l}^{L-1} \mathcal{F}(x_i, W_i)

This generalized equation has those properties:

  • The feature xLx_L of any deeper unit LL can be represented as the feature xlx_l of any shallower unit ll plus a residual function in a form of i=1L1F\sum_{i=1}^{L-1}\mathcal{F}
  • The feature xL=x0+i=0L1F(xi,Wi)x_L = x_0 + \sum_{i=0}^{L-1}\mathcal{F}(x_i, W_i), of any deep unit LL, is the summation of the outputs of all preceding residual functions.

And with this equation, we can also have a nice backward propagation property:

εxl=εxLxLxl=εxL(1+xli=0L1F(xi,Wi))\frac{\partial \varepsilon}{\partial x_l} = \frac{\partial \varepsilon}{\partial x_L} \frac{\partial x_L}{\partial x_l} = \frac{\partial \varepsilon}{\partial x_L} \left(1 + \frac{\partial}{\partial x_l} \sum_{i=0}^{L-1}\mathcal{F}(x_i, W_i)\right)

This suggests that the gradient can be decomposed into two parts:

  • a term of εxL\frac{\partial \varepsilon}{\partial x_L} that propagates information directly without concerning any weight layers, ensures that information is directly propagated back to any shallower unit ll
  • a term of εxL(xli=0L1F(xi,Wi))\frac{\partial \varepsilon}{\partial x_L}(\frac{\partial}{\partial x_l} \sum_{i=0}^{L-1}\mathcal{F}(x_i, W_i)) that propagates through the weight layers.

Note that the gradient is not likely to vanish, as the term xli=0L1F(xi,Wi)\frac{\partial}{\partial x_l} \sum_{i=0}^{L-1}\mathcal{F}(x_i, W_i) is not likely to always be -1, which is a very good attribute.

Those equations suggest that the signal can be directly propagated from any unit to another, both forward and backward. Actually, two important conditions are required to ensure the above properties:

  • the identity skip connection h(xl)=xlh(x_l) = x_l
  • f is an identity mapping (Relu in this case)

In the next chapters, we will dive deeper into those two conditions.