Pages

Linear Discriminant Analysis LDA


If data points are from the discrete set then we can always divide the input space into a collection of regions.
Let us assume that we are given K classes with labels 1, 2,. . . . . ,  K. We want to classify the class k 
and l. The decision boundary between these points is that set of the variables which fk(x) = fl(x) where fk(x) = β k0 + βkT x .  It is true for any pair of classes, the input space is divided into regions of constant classification, with piecewise hyperplanar decision boundaries.  
Decision theory for classification tells us that we need to know the class posteriors Pr(G|X) for optimal classification. Suppose fk(x) is the class-conditional the density of X in class G = k and let Ï€k be the prior probability of class k, with  Σ Ï€k = 1.
Suppose that we model each class density as multivariate Gaussian then\[f_k(x) = \frac{1}{(2\pi )^{p/2} |\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_{k}^{-1}(x-\mu_{k})}\]


LDA arises in special cases when we assume that the classes have a common covariance matrix i.e.
 Î£k = Σ   â±¯k .
In comparison of the two classes ‘k’ and ‘l’, it is sufficient to look at the log-ratio
We see that\[log\frac{Pr(G=k|X = x)}{Pr(G=l|X = x)} = log\frac{f_k(x)}{f_l(x)}+log\frac{\pi_k}{\pi_l}\]\[=log\frac{\pi_k}{\pi_l}-\frac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l)+ x^T\Sigma^{-1}(\mu_k-\mu_l)\]


Equation is linear in x. Equal covariance matrices cause normalization factors to cancel, as well as the quadratic part in the exponents. This linear log-odds function implies that the decision boundary between classes k and l, the set where Pr(G =k|X = x) = Pr(G = l lX = x) is linear in x, in p dimensions a hyperplane.

The figure shows an idealized example with three classes and p = 2. Here the data do arise from three Gaussian distributions with a common covariance matrix.

we see that the linear discriminant functions\[\delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu^T_k\Sigma^{-1}\mu_k + log\pi_k\]


are an equivalent description of the decision rule, with  G(x) = argmaxδ(x). In practice we do not know the parameters of the Gaussian distributions, and will need to estimate them using our training data: 
$\bullet \hat{\pi_k} = \frac{N_k}{N}$, where $N_k$ is the number of class-k observations;
$\bullet \hat{\mu_k} =\Sigma_{g_i = k} \frac{x_i}{N_k}$
$ \bullet \hat{\Sigma} =\Sigma_{k_i = 1}^{K}\Sigma_{g_i = k} \frac{(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^T}{N-K}$

No comments:

Post a Comment

If you have any doubt, let me know

Email Subscription

Enter your email address:

Delivered by FeedBurner

INSTAGRAM FEED