Nonlinear classification

Higher Order Feature Vectors

Linear classifiers can be used to make non-linear predictions.

For example, the feature map \phi(x)=\left[\phi_1(x),\phi_2(x)\right]^T=\left[x,x^2\right]^T

Another example:

Since a possible boundary is an elipse,

\theta_1x_1+\theta_2x_2+\theta_3x_1x_2+\theta_4x_1^2+\theta_5x_2^2+\theta_0=\theta\cdot\left[x_1,x_2,x_1x_2,x_1^2,x_2^2\right]^T+\theta_0=0

\phi\left(x\right)=\left[x_1,x_2,x_1x_2,x_1^2,x_2^2\right]^T

Non-linear Classification

\mathrm{Let\ }x\in\mathbf{R}^{150},\mathrm{\ i.e.\ }x=\left[x_1,x_2,\ldots,x_{150}\right]^T

The order 3 polynomial feature vector is given by the following formula:

\phi\left(x\right)=\left[{\underbrace{x_1,\ldots,x_i,\ldots,x_{150}}}\below{\deg{1}},{\underbrace{x_1^2,x_1x_2,\ldots,x_ix_j,\ldots x_{150}.^2}}\below{\deg{2}},{\underbrace{x_1.^3,x_1.^2x_2,\ldots,x_ix_jx_k,\ldots,x_{150}.^3}}\below{\deg{3}}\right]

\mathrm{where\ }1\le i\le j

For each of the feature transformations (power 1, power 2, power 3), there are n-multichoose-power combinations. Thus:

Regression using Higher Order Polynomial feature

Assume we have n data points in the training set: \left{\left(x^{(t)},y^{(t)}\right)\right}_{t=1,\ldots,n}\mathrm{\ where\ }\left(x^{(t)},y^{(t)}\right) is the t-th training example:

The relationship between y and x can be roughly described by a cubic function, so a feature vector \phi\left(x\right) of minimum order 3 can minimize structural errors.

Effect of Regularization on Higher Order Regression

The three figures below show the fitting result of a 9th order polynomial regression with different regularization parameter lambda on the same training data.

The smallest regularization parameter lambda to A

The largest regularization parameter lambda to B

The effect of regularization is to restrict the parameters of a model to freely take on large values. This will make the model function smoother, leveling the ‘hills’ and filling the ‘valleys’. It will also make the model more stable, as a small perturbation on x will not change y significantly with smaller \parallel\theta\parallel.

Kernels as Dot Products 1

Let’s assume:

\begin{aligned} \phi(x) &=\left[x_{1}, x_{2}, x_{1}^{2}, \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right] \ \phi\left(x^{\prime}\right) &=\left[x_{1}^{\prime}, x_{2}^{\prime}, x_{1}^{\prime 2}, \sqrt{2} x_{1}^{\prime} x_{2}^{\prime}, x_{2}^{\prime 2}\right] \end{aligned}

\phi(x) \cdot \phi\left(x^{\prime}\right)=x \cdot x^{\prime}+\left(x \cdot x^{\prime}\right)^{2}

The Kernel Perceptron Algorithm

The original Perceptron Algorithm is given as the following:

Given:

\theta=\sum_{j=1}^{n}\hairsp\alpha_jy^{\left(j\right)}\phi\left(x^{\left(j\right)}\right)

The equivalent way to initialize \alpha_1,\alpha_2,\ldots,\alpha_n if we want the same result as initializing \theta=0 is \alpha_1=\ldots=\alpha_n=0.

Now look at the line “Update appropriately” in the above algorithm:

Assuming that there was a mistake in classifying the i-th data point i.e.

y^{(i)}\left(\theta \cdot x^{(i)}\right) \leq 0
\alpha_{i}=\alpha_{i}+1 is equivalent to \theta=\theta+y^{(i)} \phi\left(x^{(i)}\right)

The Mistake Condition

y^{(i)} \sum_{j=1}^{n} \alpha_{j} y^{(j)} K\left(x^{j}, x^{i}\right) \leq 0 is equivalent to K\left(x, x^{\prime}\right)=\phi(x) \phi\left(x^{\prime}\right)

Kernel Composition Rules

If f:\mathbb{R}^d\rightarrow\mathbb{R}\mathrm{\ and\ }K\left(x,x^\prime\right) is a kernel so is \widetilde{K}\left(x,x^\prime\right)=f\left(x\right)K\left(x,x^\prime\right)f\left(x^\prime\right)

If K\left(x,x^\prime\right)=\phi(x)\cdot\phi\left(x^\prime\right) : \varphi(x)=f(x)\phi(x)

The Radial Basis Kernel

The radial basis kernel is given by:

K\left(x, x^{\prime}\right)=e^{-\frac{1}{2}\left|x-x^{\prime}\right|^{2}}

If \begin{aligned} x &=[1,0,0]^{T} \ x^{\prime} &=[0,1,0]^{T} \end{aligned}

K\left(x, x^{\prime}\right)=e^{-\frac{1}{2}\left|x-x^{\prime}\right|^{2}}=e^{-\frac{1}{2}(2)}=e^{-1}

Don’t miss these tips!

We don’t spam! Read our privacy policy for more info.