Linear Classifiers

Inner product and Orthogonal vectors

Linear classier h,h(x;\theta)=sign\funcapply(\theta\cdot x) where sign = -1 or +1, (\theta\cdot x) the inner product of \theta\ and\ x.

Linear classifier with an offset (general case): h\left(x;\theta\right)=sign\ (\funcapply(\theta\cdot x) + \theta_0).

If the ith training dataset \left(x^{(i)},y^{(i)}\right),\mathrm{\ where \ }x^{(i)} is a vector and y^{(i)} is a scalar, y^{(i)} is a label, sign\funcapply\left(\theta\cdot x^{(i)}\right) is a classifier output.

If y^{(i)}\left(\theta\cdot x^{\left(i\right)}\ \right)>0, it means the classification matches the label and vice versa.

Perceptron Algorithm

The perceptron algorithm takes T (the number of iterations) and the training set as input, and aims to learn the optimal \theta,\theta_0. (case perceptron with offset y^{(i)}\left(\theta\cdot x^{\left(i\right)}+\ \theta_0)\right)>0).

\left.y^{(i)}\left(\left(\theta+y^{(i)}x^{(i)}\right)\cdot x^{(i)}+\theta_0+y^{(i)}\right)-y^{(i)}\left(\theta\cdot x^{(i)}+\theta_0\right)=\left(y^{(i)}\right)^2\left\parallel x^{(i)}\right\parallel^2+\left(y^{(i)}\right)^2=\left(y^{(i)}\right)^2\left(\left\parallel x^{(i)}\right\parallel^2+1\right)\right) > 0

the first is always greater than the latter. Considering that our goal is to minimize the training error, the update always makes the training error decrease, which is desirable.

Distance from a Line to a Point

Consider a line L in R2:

d = \frac{\left|\theta\cdot x_0+\theta_0\right|}{\parallel\theta\parallel} (P the end point of the vector x_0).

Proof:

\frac{|P R|}{|P S|}=\frac{|T V|}{|T U|}

|P R|=\frac{\left|A x_{0}+B y_{0}+C\right|}{\sqrt{A^{2}+B^{2}}}

Decision boundary vs margin boundary

The decision boundary is the set of points x which satisfy:
\theta\cdot x+\theta_0=0

The Margin Boundary is the set of points x which satisfy:
\theta\cdot x+\theta_0=\pm1

The distance between the decision boundary and the margin boundary is:
\frac{1}{\parallel\theta\parallel}

Hinge Loss and Objective Function

Hinge loss is a loss function user to train classifiers. The Hinge Loss tells us how undesirable a training example is, with regard to the margin and the correctness of its classification.

Linear Classification and Generalization

SVMs

Support vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression tasks.

 The objective function = average loss (average hinge loss) + regularization (bias the terms to get the optimum solution)

J\left(\theta,\theta_0\right)=\frac{1}{n}\sum_{i=1}^{n}\hairsp{\rm Loss}_h{\left(y^{\left(i\right)}\left(\theta\cdot x^{\left(i\right)}+\theta_0\right)\right)}+\frac{\lambda}{2}\parallel\theta\parallel^2

Support vectors refer to points that are exactly on the margin boundary

If we remove all points that are support vectors, we will get a different \theta,\theta_0

If we remove one point that is not a support vector, we will get the same \theta,\theta_0

As we increase \lambda d increases.
If the training loss is low and the validation loss is high, the model might be overfitting. In the opposite case, the model might be underfitting.

Don’t miss these tips!

We don’t spam! Read our privacy policy for more info.

Open chat
Powered by