Generative vs Discriminative models

Generative models model the probability distribution of each class
Discriminative models learn the decision boundary between the classes

Simple Multinomial Generative model

$\theta_w$ denotes the probability of model M choosing a word w. its value must lie between $0 and 1.\ 0\le\theta_w\le1$

Likelihood Function

For simplicity let’s consider W={0,1}. We want to estimate a multinomial model to generate a document D=”0101″.

For this task, we consider two multinomial models $M_1\mathrm{\ and\ }M_2\mathrm{\ with\ parameters,\ }\theta^{(1)}\mathrm{\ and\ }\theta^{(2)}$ where:

$\theta_{0}^{(1)}=\frac{1}{2}, \theta_{1}^{(1)}=\frac{1}{2}$

$P\left(D\mid\theta^{(1)}\right)$ denotes the probability of the Model 1 generating D.

$P(D \mid \theta)=\prod_{w \in W} \theta_{w}^{\text {count }(w)}$
$P\left(D \mid \theta^{(1)}\right)=\left(0.5^{2}\right)\left(0.5^{2}\right)=0.0625$

Again, if we consider:

$\theta_{0}^{(2)}=\frac{1}{5}, \theta_{1}^{(2)}=\frac{4}{5}$
$P(D \mid \theta)=\prod_{w \in W} \theta_{w}^{\operatorname{count}(w)}$
$P\left(D \mid \theta^{(2)}\right)=\left(0.2^{2}\right)\left(0.8^{2}\right)=.0256$

$M_{1}$ is better than $M_{2}$

Maximum Likelihood Estimate

Consider the vocabulary $W={a,b,c.\ldots,z}$

Our model M can have 25 parameters to express the probability of each letter.

Let $\mathrm{\ }\theta^\ast=\theta_a^\ast,\theta_b^\ast,\ldots,\theta_z^\ast$ be the parameters of M* then: $\theta_e^\ast=2\theta_z^\ast$

MLE for Multinomial Distribution

Let $P(D \mid \theta)$ be the probability of $D$ being generated by the simple model described above.

$P(D \mid \theta)=\prod_{i=1}^{n} \theta_{w_{i}}=\prod_{w \in W} \theta_{w}^{\operatorname{count}(w)}$

Stationary Points of the Lagrange Function

$P(D \mid \theta)=\prod_{w \in W}\left(\theta_{w}\right)^{\operatorname{count}(w)}$

Maximizing \mathrm P(D\mid\theta)\mathrm{\ is equivalent to maximizing $\log}\funcapply P(D\mid\theta)$

$\log P(D \mid \theta)=\sum_{w \in W} \operatorname{count}(w) \log \theta_{w}$

We know that:

$\sum_{w \in W} \theta_{w}=1$

Define the Lagrange function:

$L=\log P(D \mid \theta)+\lambda\left(\sum_{w \in W} \theta_{w}-1\right)$

Then, find the stationary points of L by solving the equation $\nabla_\theta L=0$

$\frac{\partial}{\partial \theta_{w}}\left(\log P(D \mid \theta)+\lambda\left(\sum_{w \in W} \theta_{w}-1\right)\right)=0$ for all $w \in W$
$\theta_{w}=\frac{-\operatorname{count}(w)}{\lambda}$
$\lambda=-\sum_{w \in W} \operatorname{count}(w)$
$\theta_{w}=\frac{\operatorname{count}(w)}{\sum_{w \in W} \operatorname{count}(w)}$

Predictions of a Generative Multinomial Model

Also, suppose that we classify a new document D to belong to the positive class iff:

$\log \frac{P\left(D \mid \theta^{+}\right)}{P\left(D \mid \theta^{-}\right)} \geq 0$

The document is classified as positive iff $}P\left(D\mid\theta^+\right)\geq P\left(D\mid\theta^-\right)$

The generative classifier M can be shown to be equivalent to a linear classifier given by $\sum_{w\in W}\hairsp\left(count\funcapply(w)\theta_w^\prime\right)\geq0 \mathrm{\ where\ }\theta_w^\prime=\log\funcapply\frac{\theta_w^+}{\theta_w^-}$

Prior, Posterior and Likelihood

Consider a binary classification task with two labels ‘+’ (positive) and ‘-‘ (negative).

Let y denote the classification label assigned to a document D by a multinomial generative model M with parameters for the positive class and for the negative class.

$P(y=+\mid D)$ is the posterior distribution
$P(y=+)$ is the prior disctibution

Example

$P(y=+)=0.3$
$P\left(D \mid \theta^{+}\right)=.3$ and $P\left(D \mid \theta^{-}\right)=.6$
$P(D)=P(D \mid y=+) P(y=+)+P(D \mid y=-) P(y=-)$
$P(D)=0.51$
$P(y=+\mid D)=\frac{P\left(D \mid \theta^{+}\right) P(y=+)}{P(D)}$
$P(y=+\mid D)=0.18$

Gaussian Generative models

MLE for the Gaussian Distribution

The probability density function for a Gaussian random variable is given as follows:

$f_{X}\left(x \mid \mu, \sigma^{2}\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{d}{2}}} e^{-| x-\frac{\mu |^{2}}{2 \sigma^{2}}}$

Let $S_{n}=\left{X^{(1)}, X^{(2)}, \ldots X^{(n)}\right}$ be i.i.d r.v with mean $\mu$ and variance $\sigma^{2}$

Then their joint probability density function is given by:

$\prod_{t=1}^{n} P\left(x^{(t)} \mid \mu, \sigma^{2}\right)=\prod_{t=1}^{n} \frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{d}{2}}} e^{-\frac{\left|x^{(t)}-\mu\right|^{2}}{2 \sigma^{2}}}$

Taking logarithm of the above function, we get:

$\begin{aligned} \log P\left(S_{n} \mid \mu, \sigma^{2}\right)=\log (&\left.\prod_{t=1}^{n} \frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{d}{2}}} e^{-\frac{\left|x^{(t)}-\mu\right|^{2}}{2 \sigma^{2}}}\right)=\sum_{t=1}^{n} \log \frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{d}{2}}}+\sum_{t=1}^{n} \log e^{-\frac{\left|x^{(t)}-\mu\right|^{2}}{2 \sigma^{2}}} \ &=\sum_{t=1}^{n}-\frac{d}{2} \log \left(2 \pi \sigma^{2}\right)+\sum_{t=1}^{n} \log e^{-\frac{\left|x^{(t)}-\mu\right|^{2}}{2 \sigma^{2}}} \ &=-\frac{n d}{2} \log \left(2 \pi \sigma^{2}\right)-\frac{1}{2 \sigma^{2}} \sum_{t=1}^{n}\left|x^{(t)}-\mu\right|^{2} \ & \frac{\partial \log P\left(S_{n} \mid \mu, \sigma^{2}\right)}{\partial \mu}=\frac{1}{\sigma^{2}} \sum_{t=1}^{n}\left(x^{(t)}-\mu\right) \end{aligned}$

MLE for the Mean and variance:

$\begin{aligned} \hat{\mu} &=\frac{\sum_{t=1}^{n} x^{(t)}}{n} \ \frac{\partial \log P\left(S_{n} \mid \mu, \sigma^{2}\right)}{\partial \sigma^{2}} &=-\frac{n d}{2 \sigma^{2}}+\frac{\sum_{t=1}^{n}\left|x^{(t)}-\mu\right|^{2}}{2\left(\sigma^{2}\right)^{2}} \end{aligned}$

$\hat{\sigma}^{2}=\frac{\sum_{t=1}^{n}\left|x^{(t)}-\mu\right|^{2}}{n d}$