Gradient Descent

Our goal is to find \theta that minimizes:

J\left(\theta,\theta_0\right)=\frac{1}{n}\sum_{i=1}^{n}\hairsp{\rm Loss}_h\funcapply\left(y^{(i)}\left(\theta\cdot x^{(i)}+\theta_0\right)\right)+\frac{\lambda}{2}\parallel\theta\parallel^2

through gradient descent we will:
Start at an arbitrary location: \theta\gets\theta_{\mathrm{start\ }} , update theta until \theta\gets\theta-\eta\frac{\partial J\left(\theta,\theta_0\right)}{\partial\theta} change becomes insignificant.

Theta moves towards the origin. If we increase the step-size \eta the magnitude of change in each update gets larger.

Stochastic Gradient Descent

SGD and Hinge loss:

\begin{aligned} J\left(\theta, \theta_{0}\right)=\frac{1}{n} \sum_{i=1}^{n} \operatorname{Loss}{h}\left(y^{(i)}\left(\theta \cdot x^{(i)}+\theta{0}\right)\right)+\frac{\lambda}{2}|\theta|^{2} \ &=\frac{1}{n} \sum_{i=1}^{n}\left[\operatorname{Loss}{h}\left(y^{(i)}\left(\theta \cdot x^{(i)}+\theta{0}\right)\right)+\frac{\lambda}{2}|\theta|^{2}\right] \end{aligned}

With SGD, we choose randomly i\in{1,\ldots,n} to update theta such that:

\theta\gets\theta-\eta\nabla_\theta\left[{\rm Loss}_h{\left(y^{\left(i\right)}\left(\theta\cdot x^{\left(i\right)}+\theta_0\right)\right)}+\frac{\lambda}{2}\parallel\theta\parallel^2\right]

If {\rm Loss}_h\funcapply\left(y^{(i)}\left(\theta\cdot x^{(i)}+\theta_0\right)\right)>0 then: \nabla_\theta{\rm Loss}_h{\left(y^{\left(i\right)}\left(\theta\cdot x^{\left(i\right)}+\theta_0\right)\right)}=-y^{\left(i\right)}x^{\left(i\right)}

Don’t miss these tips!

We don’t spam! Read our privacy policy for more info.

Open chat
Powered by