Back-propagation Algorithm

The goal is to minimize the following loss function using stochastic gradient descent algorithm:

\mathcal{L}\left(y, f_{L}\right)=\left(y-f_{L}\right)^{2}

Let \delta_{i}=\frac{\partial \mathcal{L}}{\partial z_{i}}
\frac{\partial \mathcal{L}}{\partial w_{1}}=x\left(1-f_{1}^{2}\right)\left(1-f_{2}^{2}\right) \cdots\left(1-f_{L}^{2}\right) w_{2} w_{3} \cdots w_{L}\left(2\left(f_{L}-y\right)\right)

SGD Convergence guarantees

  • For multi-layer neural networks, stochastic gradient descent (SGD) is not guaranteed to reach a global optimum
  • Larger models tend to be easier to learn because their units need to be adjusted so that they are, collectively sufficient to solve the task