Linear Regression

By definition, in regression, the observed value y is a real number(continuous), unlike y is discrete in classification. The predictor f, which tries to emulate/predict y is defined as f(x)=\sum_{i=1}^{d}\hairsp\theta_ix_i+\theta_0

Compute Hinge Loss

The empirical risk is defined as:

R_n\left(\theta\right)=\frac{1}{n}\sum_{t=1}^{n}\hairspLoss{\left(y^{\left(t\right)}-\theta\cdot x^{\left(t\right)}\right)}

Where \left(x^{(t)},y^{(t)}\right) is the tth training example (and there are n in total), and Loss is some loss function, such as hinge loss. the definition of hinge loss:

\operatorname{Loss}_{h}(z)=\left{\begin{array}{ll}0 & \text { if } z \geq 1 \ 1-z, & \text { otherwise }\end{array}\right.

Example:

Given \theta=[0,1,2]T
Compute: R_n\left(\theta\right)=\frac{1}{4}\sum_{t=1}^{4}\hairsp{\rm Loss}_h{\left(y^{\left(t\right)}-\theta\cdot x^{\left(t\right)}\right)}
We can calculate :

Hence: R_n(\theta)=\frac{1}{4}\sum_{t=1}^{4}\hairsp{\rm Loss}_h\funcapply\left(y^{(t)}-\theta\cdot x^{(t)}\right)=1.25

Geometrically Identifying Error

Here, the structural error occurs because the true underlying relationship is non-linear but the regression function is linear.

The larger the training set is, the smaller the estimation error will be. The structural error occurs when the true underlying relationship is highly non-linear, so it is not relevant to increasing n.

Obtaining 0 empirical risks for a large amount of data means that it is possible that the model is overfitted.

Necessary and Sufficient Condition for a Solution

Computing the gradient of:

R_n\left(\theta\right)=\frac{1}{n}\sum_{t=1}^{n}\hairsp\frac{\left(y^{\left(t\right)}-\theta\cdot x^{\left(t\right)}\right)^2}{2}

We get:
\nabla R_n\left(\theta\right)=A\theta-b\left(=0\right)\mathrm{\ where\ }A=\frac{1}{n}\sum_{t=1}^{n}\hairsp x^{\left(t\right)}\left(x^{\left(t\right)}\right)^T,b=\frac{1}{n}\sum_{t=1}^{n}\hairsp y^{\left(t\right)}x^{\left(t\right)}

For any square matrix A,A\theta-b=0 has a unique solution \theta=A^{-1}b if and only if A is invertible.

Regularization: extreme case 1

If we define the loss function:

J_{n,\lambda}\left(\theta,\theta_0\right)=\frac{1}{n}\sum_{t=1}^{n}\hairsp\frac{\left(y^{\left(t\right)}-\theta\cdot x^{\left(t\right)}-\theta_0\right)^2}{2}+\frac{\lambda}{2}\parallel\theta\parallel^2

where \lambda is the regularization factor.

If we increase \lambda to infinity, minimizing J is equivalent to minimizing || \theta\ ||. Thus \theta will have to be a zero vector. Thus f(x)=\theta\cdot x+\theta_0 becomes f(x)=\theta_0, a horizontal line. Thus f converges to line 1.

Don’t miss these tips!

We don’t spam! Read our privacy policy for more info.

Open chat
Powered by