The goal is to minimize the following loss function using stochastic gradient descent algorithm:

Let
SGD Convergence guarantees
- For multi-layer neural networks, stochastic gradient descent (SGD) is not guaranteed to reach a global optimum
- Larger models tend to be easier to learn because their units need to be adjusted so that they are, collectively sufficient to solve the task