Linear Discriminant Analysis LDA

If data points are from the discrete set then we can always divide the input space into a collection of regions.

Let us assume that we are given K classes with labels 1, 2,. . . . . , K. We want to classify the class k

and l. The decision boundary between these points is that set of the variables which f_k(x) = f_l(x) where f_k(x) = β_k0 + β_k^Tx . It is true for any pair of classes, the input space is divided into regions of constant classification, with piecewise hyperplanar decision boundaries.

Decision theory for classification tells us that we need to know the class posteriors Pr(G|X) for optimal classification. Suppose f_k(x) is the class-conditional the density of X in class G = k and let π_k be the prior probability of class k, with Σ π_k= 1.

Suppose that we model each class density as multivariate Gaussian then\[f_k(x) = \frac{1}{(2\pi )^{p/2} |\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_{k}^{-1}(x-\mu_{k})}\]

LDA arises in special cases when we assume that the classes have a common covariance matrix i.e.

Σ_k = Σ Ɐk .

In comparison of the two classes ‘k’ and ‘l’, it is sufficient to look at the log-ratio

We see that\[log\frac{Pr(G=k|X = x)}{Pr(G=l|X = x)} = log\frac{f_k(x)}{f_l(x)}+log\frac{\pi_k}{\pi_l}\]\[=log\frac{\pi_k}{\pi_l}-\frac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l)+ x^T\Sigma^{-1}(\mu_k-\mu_l)\]

Equation is linear in x. Equal covariance matrices cause normalization factors to cancel, as well as the quadratic part in the exponents. This linear log-odds function implies that the decision boundary between classes k and l, the set where Pr(G =k|X = x) = Pr(G = l lX = x) is linear in x, in p dimensions a hyperplane.

The figure shows an idealized example with three classes and p = 2. Here the data do arise from three Gaussian distributions with a common covariance matrix.

we see that the linear discriminant functions\[\delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu^T_k\Sigma^{-1}\mu_k + log\pi_k\]

are an equivalent description of the decision rule, with G(x) = argmaxδ(x). In practice we do not know the parameters of the Gaussian distributions, and will need to estimate them using our training data:

$\bullet \hat{\pi_k} = \frac{N_k}{N}$, where $N_k$ is the number of class-k observations;

$\bullet \hat{\mu_k} =\Sigma_{g_i = k} \frac{x_i}{N_k}$

$ \bullet \hat{\Sigma} =\Sigma_{k_i = 1}^{K}\Sigma_{g_i = k} \frac{(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^T}{N-K}$

My Revision Web Page

Pages

Linear Discriminant Analysis LDA

No comments:

Post a Comment

BLOGGER

Follow Me

Blog Archive

Popular

Tags

Report Abuse

About Me

Creating users (login and logout pages)

Subscribe

Followers

Blog Archive

Search This Blog

Cloud

Video Of Day

Ads

Popular Posts

Pages

Email Subscription

INSTAGRAM FEED