If
data points are from the discrete set then we can always divide the input space
into a collection of regions.
Let
us assume that we are given K classes with labels 1, 2,. . . . . , K. We want to classify the class k
and l. The
decision boundary between these points is that set of the variables which fk(x) = fl(x) where fk(x) = β k0 + βkT x
. It is true for any pair of classes,
the input space is divided into regions of constant classification, with
piecewise hyperplanar decision boundaries.
Decision theory for classification tells
us that we need to know the class posteriors Pr(G|X) for optimal
classification. Suppose fk(x) is the class-conditional the density of X in
class G = k
and let πk
be the prior probability of class k, with Σ πk
= 1.
Suppose that we model
each class density as multivariate Gaussian then \[f_k(x) = \frac{1}{(2\pi )^{p/2} |\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_{k}^{-1}(x-\mu_{k})}\]
LDA arises in special
cases when we assume that the classes have a common covariance matrix i.e.
Σk = Σ Ɐk .
In
comparison of the two classes ‘k’ and ‘l’, it is sufficient to look at the
log-ratio
We
see that \[log\frac{Pr(G=k|X = x)}{Pr(G=l|X = x)} = log\frac{f_k(x)}{f_l(x)}+log\frac{\pi_k}{\pi_l}\]\[=log\frac{\pi_k}{\pi_l}-\frac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l)+ x^T\Sigma^{-1}(\mu_k-\mu_l)\]
Equation
is linear in x. Equal covariance matrices cause normalization factors to
cancel, as well as the quadratic part in the exponents. This linear log-odds
function implies that the decision boundary between classes k and l, the set
where Pr(G =k|X = x) = Pr(G = l lX = x) is linear in x, in p
dimensions a hyperplane.
The figure shows an idealized example with three classes and p = 2. Here the data
do arise from three Gaussian distributions with a common covariance matrix.
we see that
the linear discriminant functions \[\delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu^T_k\Sigma^{-1}\mu_k + log\pi_k\]
are an
equivalent description of the decision rule, with G(x) = argmaxδ(x). In practice we
do not know the parameters of the Gaussian distributions, and will need
to estimate them using our training data:
$\bullet \hat{\pi_k} = \frac{N_k}{N}$, where $N_k$ is the number of class-k observations;
$\bullet \hat{\mu_k} =\Sigma_{g_i = k} \frac{x_i}{N_k}$
$ \bullet \hat{\Sigma} =\Sigma_{k_i = 1}^{K}\Sigma_{g_i = k} \frac{(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^T}{N-K}$
No comments:
Post a Comment
If you have any doubt, let me know