Machine Learning

Generative vs Discriminative models

  • Generative models model the probability distribution of each class
  • Discriminative models learn the decision boundary between the classes

Simple Multinomial Generative model

\theta_w denotes the probability of model M choosing a word w. its value must lie between 0 and 1.\ 0\le\theta_w\le1

Likelihood Function

For simplicity let’s consider W={0,1}. We want to estimate a multinomial model to generate a document D=”0101″.

For this task, we consider two multinomial models M_1\mathrm{\ and\ }M_2\mathrm{\ with\ parameters,\ }\theta^{(1)}\mathrm{\ and\ }\theta^{(2)} where:

\theta_{0}^{(1)}=\frac{1}{2}, \theta_{1}^{(1)}=\frac{1}{2}

P\left(D\mid\theta^{(1)}\right) denotes the probability of the Model 1 generating D.

P(D \mid \theta)=\prod_{w \in W} \theta_{w}^{\text {count }(w)}
P\left(D \mid \theta^{(1)}\right)=\left(0.5^{2}\right)\left(0.5^{2}\right)=0.0625

Again, if we consider:

\theta_{0}^{(2)}=\frac{1}{5}, \theta_{1}^{(2)}=\frac{4}{5}
P(D \mid \theta)=\prod_{w \in W} \theta_{w}^{\operatorname{count}(w)}
P\left(D \mid \theta^{(2)}\right)=\left(0.2^{2}\right)\left(0.8^{2}\right)=.0256

M_{1} is better than M_{2}

Maximum Likelihood Estimate

Consider the vocabulary W={a,b,c.\ldots,z}

Our model M can have 25 parameters to express the probability of each letter.

Let \mathrm{\ }\theta^\ast=\theta_a^\ast,\theta_b^\ast,\ldots,\theta_z^\ast be the parameters of M* then: \theta_e^\ast=2\theta_z^\ast

MLE for Multinomial Distribution

Let P(D \mid \theta) be the probability of D being generated by the simple model described above.

P(D \mid \theta)=\prod_{i=1}^{n} \theta_{w_{i}}=\prod_{w \in W} \theta_{w}^{\operatorname{count}(w)}

Stationary Points of the Lagrange Function

P(D \mid \theta)=\prod_{w \in W}\left(\theta_{w}\right)^{\operatorname{count}(w)}

Maximizing \mathrm P(D\mid\theta)\mathrm{\ is equivalent to maximizing \log}\funcapply P(D\mid\theta)

\log P(D \mid \theta)=\sum_{w \in W} \operatorname{count}(w) \log \theta_{w}

We know that:

\sum_{w \in W} \theta_{w}=1

Define the Lagrange function:

L=\log P(D \mid \theta)+\lambda\left(\sum_{w \in W} \theta_{w}-1\right)

Then, find the stationary points of L by solving the equation \nabla_\theta L=0

\frac{\partial}{\partial \theta_{w}}\left(\log P(D \mid \theta)+\lambda\left(\sum_{w \in W} \theta_{w}-1\right)\right)=0 for all w \in W
\theta_{w}=\frac{-\operatorname{count}(w)}{\lambda}
\lambda=-\sum_{w \in W} \operatorname{count}(w)
\theta_{w}=\frac{\operatorname{count}(w)}{\sum_{w \in W} \operatorname{count}(w)}

Predictions of a Generative Multinomial Model

Also, suppose that we classify a new document D to belong to the positive class iff:

\log \frac{P\left(D \mid \theta^{+}\right)}{P\left(D \mid \theta^{-}\right)} \geq 0

The document is classified as positive iff }P\left(D\mid\theta^+\right)\geq P\left(D\mid\theta^-\right)

The generative classifier M can be shown to be equivalent to a linear classifier given by \sum_{w\in W}\hairsp\left(count\funcapply(w)\theta_w^\prime\right)\geq0 \mathrm{\ where\ }\theta_w^\prime=\log\funcapply\frac{\theta_w^+}{\theta_w^-}

Prior, Posterior and Likelihood

Consider a binary classification task with two labels ‘+’ (positive) and ‘-‘ (negative).

Let y denote the classification label assigned to a document D by a multinomial generative model M with parameters  for the positive class and  for the negative class.

P(y=+\mid D) is the posterior distribution
P(y=+) is the prior disctibution

Example

P(y=+)=0.3
P\left(D \mid \theta^{+}\right)=.3 and P\left(D \mid \theta^{-}\right)=.6
P(D)=P(D \mid y=+) P(y=+)+P(D \mid y=-) P(y=-)
P(D)=0.51
P(y=+\mid D)=\frac{P\left(D \mid \theta^{+}\right) P(y=+)}{P(D)}
P(y=+\mid D)=0.18

Gaussian Generative models

MLE for the Gaussian Distribution

The probability density function for a Gaussian random variable is given as follows:

    \[f_{X}\left(x \mid \mu, \sigma^{2}\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{d}{2}}} e^{-| x-\frac{\mu |^{2}}{2 \sigma^{2}}}\]


Let S_{n}=\left{X^{(1)}, X^{(2)}, \ldots X^{(n)}\right} be i.i.d r.v with mean \mu and variance \sigma^{2}

Then their joint probability density function is given by:

\prod_{t=1}^{n} P\left(x^{(t)} \mid \mu, \sigma^{2}\right)=\prod_{t=1}^{n} \frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{d}{2}}} e^{-\frac{\left|x^{(t)}-\mu\right|^{2}}{2 \sigma^{2}}}

Taking logarithm of the above function, we get:

\begin{aligned} \log P\left(S_{n} \mid \mu, \sigma^{2}\right)=\log (&\left.\prod_{t=1}^{n} \frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{d}{2}}} e^{-\frac{\left|x^{(t)}-\mu\right|^{2}}{2 \sigma^{2}}}\right)=\sum_{t=1}^{n} \log \frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{d}{2}}}+\sum_{t=1}^{n} \log e^{-\frac{\left|x^{(t)}-\mu\right|^{2}}{2 \sigma^{2}}} \ &=\sum_{t=1}^{n}-\frac{d}{2} \log \left(2 \pi \sigma^{2}\right)+\sum_{t=1}^{n} \log e^{-\frac{\left|x^{(t)}-\mu\right|^{2}}{2 \sigma^{2}}} \ &=-\frac{n d}{2} \log \left(2 \pi \sigma^{2}\right)-\frac{1}{2 \sigma^{2}} \sum_{t=1}^{n}\left|x^{(t)}-\mu\right|^{2} \ & \frac{\partial \log P\left(S_{n} \mid \mu, \sigma^{2}\right)}{\partial \mu}=\frac{1}{\sigma^{2}} \sum_{t=1}^{n}\left(x^{(t)}-\mu\right) \end{aligned}

MLE for the Mean and variance:

\begin{aligned} \hat{\mu} &=\frac{\sum_{t=1}^{n} x^{(t)}}{n} \ \frac{\partial \log P\left(S_{n} \mid \mu, \sigma^{2}\right)}{\partial \sigma^{2}} &=-\frac{n d}{2 \sigma^{2}}+\frac{\sum_{t=1}^{n}\left|x^{(t)}-\mu\right|^{2}}{2\left(\sigma^{2}\right)^{2}} \end{aligned}

\hat{\sigma}^{2}=\frac{\sum_{t=1}^{n}\left|x^{(t)}-\mu\right|^{2}}{n d}

close

Don’t miss these tips!

We don’t spam! Read our privacy policy for more info.

No products in the cart.

Open chat
Powered by