Mga Pahina

Linggo, Disyembre 13, 2015

Class Separability Measures and ROC Curves

In Theodoridis' introduction in Chapter 5 (Feature Selection), he mentioned that in order to create a good classifier, one must get rid with the "curse of dimensionality". In many cases, one is faced with several features, sometimes far greater than the number of classes. This will only lead to an increase in the computational complexity, but with very little gain.  In doing this, we must first determine how separated the classes are in the features that we have. Thus, we seek to understand and give answer to the following question:
Given a set of data consisting of different classes with corresponding features, how do we measure the degree of their separability?
Quantitatively, we aim to look for features that lead to large between-class distance and small within-class variance.

I. Class Separability Measures

The goal of the first part of the activity is to identify different class separability measures based on how the data is scattered in the l-dimensional space (number of features).
For this activity, I generated the following scatter points.
Figure 1. An example of a scattered data in a 2-dimensional space with the following parameters:


Figure 1. Three classes separated with their corresponding means and covariance matrices

The class separability measures are given by the following:
  • Within-class scatter matrix
$$S_w = \sum_{i=1}^M P_i S_i $$
where $S_i$ and $P_i$ are the covariance matrix and a priori probability for class $w_i$, respectively. We can compute $P_i$ by dividing the total number of samples per class to the overall number of datapoints, so that $P_i = \frac{n_i}{N}$.
From here, it easy to see that the $trace{S_w}$ gives the average of variance, over all classes.
  • Between-class scatter matrix
$$S_b = \sum_{i=1}^M P_i (\mu_i - \mu_O)(\mu_i - \mu_O)^T $$

where $\mu_O$ is the overall mean of the samples, or the global mean vector. This is given by:
$$\mu_O  = \sum{i}^M P_i \mu_i $$

The $trace[S_b]$ gives the average distance of the mean of each individual classes from the respective global value.
  • Mixture scatter matrix
$$S_m = E [ (\textbf{x} - \mu_O)(\textbf{x} - \mu_O)^T $$

which gives us the covariance matrix of the feature vector with respect to the global mean. Alternatively, one can compute $S_m$ by just adding $S_w$ and $S_b$, so that:
$$S_m = S_w + S_b $$.

In parallel with the previous, we see that the $trace[S_m]$ gives the sum of variances of the features around their respective global mean.
I got the following values of Sb and Sw.

From these definitions, a new set of criterion is defined, so as to be able to identify how scattered the data-points with respect to each other and within each classes. These are J1, J2, and J3, defined as follows:
$$J_1 = \frac{trace[S_m]}{trace[S_w]} $$
$$J_2 = \frac{|S_m|}{|S_w|} $$
$$J_3 = trace[\frac{|S_m|}{|S_w|} $$

which have the following values:

From the definitions of $S_m$ and $S_w$, the values of $J_1, J_2, and J_3$ should tell us something about how scattered the data is.

For comparison, I computed another set of $S$ and $J$ values for smaller variances with diagonal covariance matrix. This is for the following scattered data:
Figure 2. Second example of datasets scattered in a two dimensional space with smaller variance in each class.
The corresponding $J$-values for this is given by the following:

As can be seen from the J-values, larger values correspond to smaller within class variance so that the data in the l-dimensional space is well-clustered in their respective means.  It also means that the classes are well separated, away from each other.

II. ROC Curves

Another way of measuring how separated classes are is the use of ROC curves, or the receiver operating characteristics curve. This technique provides us information about the overlap between classes. An example of an ROC curve is shown below:

Figure 3. (a) An example of an overlapped pdf (the other one is inverted), and the corresponding (b) ROC Curve (figure obtained from Ref. [1])

The pdfs in Figure 3. describe the distribution of a feature in two classes. The vertical line represents the chosen threshold, so that those at the left are automatically for class $\omega_1$ and at the right is for class $\omega_2$. The error associated with this assumption is given by the shaded regions in Figure 3 (a).

In this part of the activity, we are tasked to generate two sets of N-randomly distributed values from 0 to 1 corresponding to a single feature, sampled from any arbitrary distribution. In my case, I used a uniform distribution bounded from 0 to 1 to create a normal distribution via the inverse sampling method. Shown in the first column of the figure below are the two distributions with different overlaps depending on the values of the respective mean and variance (the other distribution is inverted for easier inspection). 
Figure 4. ROC Curves for various separations of two classes.

The ROC curve varies with the amount of overlap between distributions. The larger the overlap, the smaller the area bounded in the ROC. In the first example [Fig.2 (a)], the two distributions completely overlapped, leading to an almost zero bounded area in its respective right panel. At the other extreme where no overlap between distributions is observed, the whole upper triangle is shaded.
Thus the ROC curve gives a measure of the class discrimination capability of the specific feature being considered, with values ranging from 0 (complete overlap) to 1/2 (no overlap).


References:
[1] S. Theodoridis. K. Koutroumbas. Pattern Recognition. 4th Ed.  United Kingdom Academic Press, 2009



Walang komento:

Mag-post ng isang Komento