Nonlinear classification

Higher Order Feature Vectors

Linear classifiers can be used to make non-linear predictions.

For example, the feature map $\phi(x)=\left[\phi_1(x),\phi_2(x)\right]^T=\left[x,x^2\right]^T$

Another example:

Since a possible boundary is an elipse,

$\theta_1x_1+\theta_2x_2+\theta_3x_1x_2+\theta_4x_1^2+\theta_5x_2^2+\theta_0=\theta\cdot\left[x_1,x_2,x_1x_2,x_1^2,x_2^2\right]^T+\theta_0=0$

$\phi\left(x\right)=\left[x_1,x_2,x_1x_2,x_1^2,x_2^2\right]^T$

Non-linear Classification

$\mathrm{Let\ }x\in\mathbf{R}^{150},\mathrm{\ i.e.\ }x=\left[x_1,x_2,\ldots,x_{150}\right]^T$

The order 3 polynomial feature vector is given by the following formula:

$\phi\left(x\right)=\left[{\underbrace{x_1,\ldots,x_i,\ldots,x_{150}}}\below{\deg{1}},{\underbrace{x_1^2,x_1x_2,\ldots,x_ix_j,\ldots x_{150}.^2}}\below{\deg{2}},{\underbrace{x_1.^3,x_1.^2x_2,\ldots,x_ix_jx_k,\ldots,x_{150}.^3}}\below{\deg{3}}\right]$

$\mathrm{where\ }1\le i\le j$

For each of the feature transformations (power 1, power 2, power 3), there are n-multichoose-power combinations. Thus:

Regression using Higher Order Polynomial feature

Assume we have n data points in the training set: $\left{\left(x^{(t)},y^{(t)}\right)\right}_{t=1,\ldots,n}\mathrm{\ where\ }\left(x^{(t)},y^{(t)}\right)$ is the $t-th$ training example:

The relationship between y and x can be roughly described by a cubic function, so a feature vector $\phi\left(x\right)$ of minimum order 3 can minimize structural errors.

Effect of Regularization on Higher Order Regression

The three figures below show the fitting result of a 9th order polynomial regression with different regularization parameter lambda on the same training data.

The smallest regularization parameter lambda to A

The largest regularization parameter lambda to B

The effect of regularization is to restrict the parameters of a model to freely take on large values. This will make the model function smoother, leveling the ‘hills’ and filling the ‘valleys’. It will also make the model more stable, as a small perturbation on x will not change y significantly with smaller $\parallel\theta\parallel$ .

Kernels as Dot Products 1

Let’s assume:

$\begin{aligned} \phi(x) &=\left[x_{1}, x_{2}, x_{1}^{2}, \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right] \ \phi\left(x^{\prime}\right) &=\left[x_{1}^{\prime}, x_{2}^{\prime}, x_{1}^{\prime 2}, \sqrt{2} x_{1}^{\prime} x_{2}^{\prime}, x_{2}^{\prime 2}\right] \end{aligned}$

$\phi(x) \cdot \phi\left(x^{\prime}\right)=x \cdot x^{\prime}+\left(x \cdot x^{\prime}\right)^{2}$

The Kernel Perceptron Algorithm

The original Perceptron Algorithm is given as the following:

Given:

$\theta=\sum_{j=1}^{n}\hairsp\alpha_jy^{\left(j\right)}\phi\left(x^{\left(j\right)}\right)$

The equivalent way to initialize $\alpha_1,\alpha_2,\ldots,\alpha_n$ if we want the same result as initializing $\theta=0$ is $\alpha_1=\ldots=\alpha_n=0$ .

Now look at the line “Update appropriately” in the above algorithm:

Assuming that there was a mistake in classifying the $i-th$ data point i.e.

$y^{(i)}\left(\theta \cdot x^{(i)}\right) \leq 0$
$\alpha_{i}=\alpha_{i}+1$ is equivalent to $\theta=\theta+y^{(i)} \phi\left(x^{(i)}\right)$

The Mistake Condition

$y^{(i)} \sum_{j=1}^{n} \alpha_{j} y^{(j)} K\left(x^{j}, x^{i}\right) \leq 0$ is equivalent to $K\left(x, x^{\prime}\right)=\phi(x) \phi\left(x^{\prime}\right)$

Kernel Composition Rules

If $f:\mathbb{R}^d\rightarrow\mathbb{R}\mathrm{\ and\ }K\left(x,x^\prime\right)$ is a kernel so is $\widetilde{K}\left(x,x^\prime\right)=f\left(x\right)K\left(x,x^\prime\right)f\left(x^\prime\right)$