Linear Regression

By definition, in regression, the observed value y is a real number(continuous), unlike y is discrete in classification. The predictor f, which tries to emulate/predict y is defined as $f(x)=\sum_{i=1}^{d}\hairsp\theta_ix_i+\theta_0$

Compute Hinge Loss

The empirical risk is defined as:

$R_n\left(\theta\right)=\frac{1}{n}\sum_{t=1}^{n}\hairspLoss{\left(y^{\left(t\right)}-\theta\cdot x^{\left(t\right)}\right)}$

Where $\left(x^{(t)},y^{(t)}\right)$ is the tth training example (and there are n in total), and Loss is some loss function, such as hinge loss. the definition of hinge loss:

$\operatorname{Loss}_{h}(z)=\left{\begin{array}{ll}0 & \text { if } z \geq 1 \ 1-z, & \text { otherwise }\end{array}\right.$

Example:

Given $\theta=[0,1,2]T$
Compute: $R_n\left(\theta\right)=\frac{1}{4}\sum_{t=1}^{4}\hairsp{\rm Loss}_h{\left(y^{\left(t\right)}-\theta\cdot x^{\left(t\right)}\right)}$
We can calculate :

Hence: $R_n(\theta)=\frac{1}{4}\sum_{t=1}^{4}\hairsp{\rm Loss}_h\funcapply\left(y^{(t)}-\theta\cdot x^{(t)}\right)=1.25$

Geometrically Identifying Error

Here, the structural error occurs because the true underlying relationship is non-linear but the regression function is linear.

The larger the training set is, the smaller the estimation error will be. The structural error occurs when the true underlying relationship is highly non-linear, so it is not relevant to increasing n.

Obtaining 0 empirical risks for a large amount of data means that it is possible that the model is overfitted.

Necessary and Sufficient Condition for a Solution

Computing the gradient of:

$R_n\left(\theta\right)=\frac{1}{n}\sum_{t=1}^{n}\hairsp\frac{\left(y^{\left(t\right)}-\theta\cdot x^{\left(t\right)}\right)^2}{2}$

We get:
$\nabla R_n\left(\theta\right)=A\theta-b\left(=0\right)\mathrm{\ where\ }A=\frac{1}{n}\sum_{t=1}^{n}\hairsp x^{\left(t\right)}\left(x^{\left(t\right)}\right)^T,b=\frac{1}{n}\sum_{t=1}^{n}\hairsp y^{\left(t\right)}x^{\left(t\right)}$

For any square matrix $A,A\theta-b=0$ has a unique solution $\theta=A^{-1}b$ if and only if A is invertible.

Regularization: extreme case 1

If we define the loss function:

$J_{n,\lambda}\left(\theta,\theta_0\right)=\frac{1}{n}\sum_{t=1}^{n}\hairsp\frac{\left(y^{\left(t\right)}-\theta\cdot x^{\left(t\right)}-\theta_0\right)^2}{2}+\frac{\lambda}{2}\parallel\theta\parallel^2$

where $\lambda$ is the regularization factor.

If we increase $\lambda$ to infinity, minimizing J is equivalent to minimizing $|| \theta\ ||$ . Thus $\theta$ will have to be a zero vector. Thus $f(x)=\theta\cdot x+\theta_0$ becomes $f(x)=\theta_0$ , a horizontal line. Thus f converges to line 1.