Mga Pahina

Biyernes, Disyembre 18, 2015

Expectation-Maximization Algorithm

This activity focused on the use of the Expectation-Maximization Algorithm to compute for the means and variances of an unknown set of data obtained from a multimodal distribution. The figure below shows an example of a multimodal distribution.

Figure 1. Example of a multimodal distribution consisting of N=3000, and J=3, with the following parameters: 

The EM Algorithm is basically an iterative method for finding the maximum likelihood of parameter estimates in an unknown incomplete data.

Suppose we do not know anything about the parameters that make up the multimodal set of data shown in Figure 1. We may assume the number of Gaussian distributions present in the two-dimensional space. In my work, I used J=3 (3 clusters/Gaussian distributions). We also need to impose initial parameter guess for $\theta(0) = [\mu, \sigma^2, P] $. The $\mu$, $\sigma$ and $P$ are the corresponding means, variances, and a priori probability for each cluster J.

Then, we have to compute for the conditional probability given by:
$$p(x_k| j ; \theta) = \frac{1}{(2\sigma_j^2)^{(\ell/2)}} exp(-\frac{|| x_k - \mu_j ||^2}{2\sigma^2})$$

where $\ell$ here is the dimension of the space (in our case, l=2). Then, we substitute this to:

$$ P(j | x_k; \Theta(t)) = \frac{p(x_k| j ; \theta(t)) P_j(t)}{p(x_k, \Theta(t)} $$

where we take $p(x_k, \Theta(t))$ for normalization, so that:
$$p(x_k, \Theta(t)) =  \sum_{j=1}^J p(x_k| j ; \theta(t)) P_j(t) $$

We then use the preceding equations to compute for the means, variances and a priori probability P of the Gaussian mixtures by performing the maximization step as follows:

We take:
$$\mu_j (t+1) = \frac{\sum_{k=1}^N P(j | x_k; \Theta(t)) x_k}{\sum_{k=1}^N P(j | x_k; \Theta(t))} $$
$$\sigma_d^2 = \frac{\sum_{k=1}^N P(j | x_k; \Theta(t)) ||x_k - \mu_j(t+1)||^2}{\ell \sum_{k=1}^N P(j | x_k; \Theta(t))} $$
$$P_j (t+1) = \frac{1}{N} \sum_{k=1}^N P(j | x_k; \Theta(t)) $$

The equations shown above constitute the EM Algorithm, where we iterate the value of the parameter $\theta$ until such time that we reach a threshold $\epsilon$ where:
$$ ||\theta(t+1)-\theta(t)|| = \epsilon $$.

The threshold $\epsilon$ therefore gives the accuracy of the result which depends upon the imposed value of the user. A smaller $\epsilon$ will just correspond to more iterations.

I performed the EM Algorithm in Matlab, and I got the following result of the parameter estimates for Figure 1.

These parameters are good estimates with minimal error. If we want to have a smaller error in the estimates then we just have to increase the number of iterations by decreasing the value of $\epsilon$.

The error vs the number of iterations made by the algorithm is shown below.

Figure 2. Error vs the number of iterations for (a) linear scale and (b) logarithmic scale.

 At around 15th iteration, the error stabilized. However, looking at the logarithm scaled plot, at a certain number of iteration (around 50 to 60), a peak is formed, before it eventually falls down.

I tried to change the variance of the dataset so that the tendency for each cluster to overlap is higher, but I still got good estimates. 

Figure 3. Another trial with higher variance. I got the following estimates
which is not bad, considering that the data have many overlaps.

Naturally, higher number of iterations is needed for the program to have a good estimate, as indicated by the following plot of error vs iterations.
Figure 4. Error vs. Iterations for trial 2 (larger variance)



References
[1] S. Theodoridis. Pattern Recognition.  United Kingdom. Academic Press. 1999


Mapping of school-level frequency of disaster occurrence and NAT Scores using Neural Networks

Our task in this activity is to play around with neural network and use it for any fitting or pattern recognition applications. Since I work part-time as a research assistant in a university-based project on interdisciplinary research about the complexity of Philippine Public Schools, I have with me an almost complete dataset of public schools in the Philippines, together with their yearly counts of disaster experiences, evacuation center usage, dropouts, graduation and even NAT (National Achievement Test) Scores. The main plan was to develop a model that describes the resilience of public schools to natural disasters. Intuitively speaking, one would think that if a school always experience flooding or any type of disaster, then class disruptions would result to a lower level overall performance of the school. This hypothesis, however, is not readily observed if one only look at the national-level. National-level data is noisy, and presents zero variances.

My task in the project was to clean the data and look for correlations between variables in the dataset. For a correlation between datasets to exists, it is important to check first the corresponding variances. While national-level analyses do not show relationship between disaster experience and test scores, sub-national levels give a more meaningful result. These results are often masked by other noisy data when we consider the national-level.

In this work, I will look at the relationship between frequency of  experienced typhoon and overall disasters, and the number of times the school is used as an evacuation center to the performance of the schools in the NAT. In particular, I will focus on the data from the province of Leyte.

Will the Neural network be able to predict successfully the probable performance of a school given the disaster variables?

I used the existing Neural Network fitting in Matlab. There are three training algorithms present. For each of the algorithm, I varied the number of hidden layers, and observed the error distribution.

Training algorithms:

A. Levenberg-Marquardt

Here, I used the default number of hidden neurons, which is 10.


Figure A1. Left: Error histogram. Right: Regression analysis for training, validation and tests. 
The error distribution is Gaussian in shape with a mean at about -2.

I tried to get change the number of hidden neurons to 100, and I got the following error histogram.
Figure A2. Error histogram for hidden neurons = 100. 

The mean is still around 2.3, but an outlier is observed at around 109.9, thereby increasing the variance in the distribution.


 B. Scaled Conjugate Gradient
Figure C1.  Left: Error histogram. Right: Regression analysis for training, validation and tests. 

Figure C2. Error histogram for hidden neuron = 100, with the network retrained multiple times. 

Training multiple times will generate different results due to different initial conditions and sampling. As can be seen from Figure C2, sometimes, repeatedly training the network does not always lead to better results.

It should be noted that the these learning algorithms differ in terms of the speed of the training and the amount of memory used in the program. The number of hidden neurons is sometimes increased to improve the performance of the network. However, in cases where the data consists of simpler, smaller number of points, perhaps it is better to use a smaller number of hidden neurons. Based from what I've read, the number of hidden neurons roughly should be between the size of the input and the size of the output [1].

As can be observed from the regression  plots and error histograms, the neural network can indeed predict to a moderate level of accuracy the possible test score of a school given the frequency of disasters it experienced. 


[1] Jeff Heaton. Introduction to Neural Networks for Java.




Martes, Disyembre 15, 2015

Color Feature Extraction as a Method to Track Skin Color

When dealing with image retrieval, indexing, and classification, perhaps the easiest feature to use is the color due to it being visually easy to inspect. However, the large variety in its wide spectrum as well as the unavoidable changes in the illumination during experimentation give rise to certain difficulties that need to be addressed.

In this activity, the goal is to be able to track our face in a real-time video by using our skin as a feature. The additional challenge is that it has to be successful even if the illumination varies from time to time. Like I have previously discussed in the segmentation activity in our Applied Physics 186 course, the success of the segmentation is highly dependent on the region of interest (ROI) that is used in constructing the 2D-histogram below:

Figure 1. Normalized Chromaticity Index (NCC) color space (left) and the corresponding 2D histogram of my skin (right)



If we use a region of interest that only corresponds to a single illumination, most likely the program won't be able to successfully track the skin. To demonstrate this, I tried using two types of ROI. Let's try first using a single illumination as ROI:

(a) Single Illumination in ROI

Figure 2. One region of my face at a single illumination

Using Figure 2 as ROI, I arrived at the following result:


Video 1. Demonstration of face tracking using a single representative ROI under a single illumination

As one may easily observe, the bounding box switches from small to big, suggesting that only portion of my face was detected. At the end of the demonstration, it was able to track the whole face since the illumination is now similar to that of the ROI used.


(b) Varying illumination in ROI

Figure 3. Several regions of my skin at various illumination.


Using different representative ROI from different frames of the vid, I was able to fully track the whole of my face, unlike the result of the previous where only a single cheek is bounded by the bounding box.


As much as I want to post here a real-time face tracker, I'm still in the process of figuring out how to post a user-defined media in blogger. So in the mean time, please bear with the demos first. Ciao! :)

Linggo, Disyembre 13, 2015

Class Separability Measures and ROC Curves

In Theodoridis' introduction in Chapter 5 (Feature Selection), he mentioned that in order to create a good classifier, one must get rid with the "curse of dimensionality". In many cases, one is faced with several features, sometimes far greater than the number of classes. This will only lead to an increase in the computational complexity, but with very little gain.  In doing this, we must first determine how separated the classes are in the features that we have. Thus, we seek to understand and give answer to the following question:
Given a set of data consisting of different classes with corresponding features, how do we measure the degree of their separability?
Quantitatively, we aim to look for features that lead to large between-class distance and small within-class variance.

I. Class Separability Measures

The goal of the first part of the activity is to identify different class separability measures based on how the data is scattered in the l-dimensional space (number of features).
For this activity, I generated the following scatter points.
Figure 1. An example of a scattered data in a 2-dimensional space with the following parameters:


Figure 1. Three classes separated with their corresponding means and covariance matrices

The class separability measures are given by the following:
  • Within-class scatter matrix
$$S_w = \sum_{i=1}^M P_i S_i $$
where $S_i$ and $P_i$ are the covariance matrix and a priori probability for class $w_i$, respectively. We can compute $P_i$ by dividing the total number of samples per class to the overall number of datapoints, so that $P_i = \frac{n_i}{N}$.
From here, it easy to see that the $trace{S_w}$ gives the average of variance, over all classes.
  • Between-class scatter matrix
$$S_b = \sum_{i=1}^M P_i (\mu_i - \mu_O)(\mu_i - \mu_O)^T $$

where $\mu_O$ is the overall mean of the samples, or the global mean vector. This is given by:
$$\mu_O  = \sum{i}^M P_i \mu_i $$

The $trace[S_b]$ gives the average distance of the mean of each individual classes from the respective global value.
  • Mixture scatter matrix
$$S_m = E [ (\textbf{x} - \mu_O)(\textbf{x} - \mu_O)^T $$

which gives us the covariance matrix of the feature vector with respect to the global mean. Alternatively, one can compute $S_m$ by just adding $S_w$ and $S_b$, so that:
$$S_m = S_w + S_b $$.

In parallel with the previous, we see that the $trace[S_m]$ gives the sum of variances of the features around their respective global mean.
I got the following values of Sb and Sw.

From these definitions, a new set of criterion is defined, so as to be able to identify how scattered the data-points with respect to each other and within each classes. These are J1, J2, and J3, defined as follows:
$$J_1 = \frac{trace[S_m]}{trace[S_w]} $$
$$J_2 = \frac{|S_m|}{|S_w|} $$
$$J_3 = trace[\frac{|S_m|}{|S_w|} $$

which have the following values:

From the definitions of $S_m$ and $S_w$, the values of $J_1, J_2, and J_3$ should tell us something about how scattered the data is.

For comparison, I computed another set of $S$ and $J$ values for smaller variances with diagonal covariance matrix. This is for the following scattered data:
Figure 2. Second example of datasets scattered in a two dimensional space with smaller variance in each class.
The corresponding $J$-values for this is given by the following:

As can be seen from the J-values, larger values correspond to smaller within class variance so that the data in the l-dimensional space is well-clustered in their respective means.  It also means that the classes are well separated, away from each other.

II. ROC Curves

Another way of measuring how separated classes are is the use of ROC curves, or the receiver operating characteristics curve. This technique provides us information about the overlap between classes. An example of an ROC curve is shown below:

Figure 3. (a) An example of an overlapped pdf (the other one is inverted), and the corresponding (b) ROC Curve (figure obtained from Ref. [1])

The pdfs in Figure 3. describe the distribution of a feature in two classes. The vertical line represents the chosen threshold, so that those at the left are automatically for class $\omega_1$ and at the right is for class $\omega_2$. The error associated with this assumption is given by the shaded regions in Figure 3 (a).

In this part of the activity, we are tasked to generate two sets of N-randomly distributed values from 0 to 1 corresponding to a single feature, sampled from any arbitrary distribution. In my case, I used a uniform distribution bounded from 0 to 1 to create a normal distribution via the inverse sampling method. Shown in the first column of the figure below are the two distributions with different overlaps depending on the values of the respective mean and variance (the other distribution is inverted for easier inspection). 
Figure 4. ROC Curves for various separations of two classes.

The ROC curve varies with the amount of overlap between distributions. The larger the overlap, the smaller the area bounded in the ROC. In the first example [Fig.2 (a)], the two distributions completely overlapped, leading to an almost zero bounded area in its respective right panel. At the other extreme where no overlap between distributions is observed, the whole upper triangle is shaded.
Thus the ROC curve gives a measure of the class discrimination capability of the specific feature being considered, with values ranging from 0 (complete overlap) to 1/2 (no overlap).


References:
[1] S. Theodoridis. K. Koutroumbas. Pattern Recognition. 4th Ed.  United Kingdom Academic Press, 2009