Mga Pahina

Biyernes, Disyembre 18, 2015

Expectation-Maximization Algorithm

This activity focused on the use of the Expectation-Maximization Algorithm to compute for the means and variances of an unknown set of data obtained from a multimodal distribution. The figure below shows an example of a multimodal distribution.

Figure 1. Example of a multimodal distribution consisting of N=3000, and J=3, with the following parameters: 

The EM Algorithm is basically an iterative method for finding the maximum likelihood of parameter estimates in an unknown incomplete data.

Suppose we do not know anything about the parameters that make up the multimodal set of data shown in Figure 1. We may assume the number of Gaussian distributions present in the two-dimensional space. In my work, I used J=3 (3 clusters/Gaussian distributions). We also need to impose initial parameter guess for $\theta(0) = [\mu, \sigma^2, P] $. The $\mu$, $\sigma$ and $P$ are the corresponding means, variances, and a priori probability for each cluster J.

Then, we have to compute for the conditional probability given by:
$$p(x_k| j ; \theta) = \frac{1}{(2\sigma_j^2)^{(\ell/2)}} exp(-\frac{|| x_k - \mu_j ||^2}{2\sigma^2})$$

where $\ell$ here is the dimension of the space (in our case, l=2). Then, we substitute this to:

$$ P(j | x_k; \Theta(t)) = \frac{p(x_k| j ; \theta(t)) P_j(t)}{p(x_k, \Theta(t)} $$

where we take $p(x_k, \Theta(t))$ for normalization, so that:
$$p(x_k, \Theta(t)) =  \sum_{j=1}^J p(x_k| j ; \theta(t)) P_j(t) $$

We then use the preceding equations to compute for the means, variances and a priori probability P of the Gaussian mixtures by performing the maximization step as follows:

We take:
$$\mu_j (t+1) = \frac{\sum_{k=1}^N P(j | x_k; \Theta(t)) x_k}{\sum_{k=1}^N P(j | x_k; \Theta(t))} $$
$$\sigma_d^2 = \frac{\sum_{k=1}^N P(j | x_k; \Theta(t)) ||x_k - \mu_j(t+1)||^2}{\ell \sum_{k=1}^N P(j | x_k; \Theta(t))} $$
$$P_j (t+1) = \frac{1}{N} \sum_{k=1}^N P(j | x_k; \Theta(t)) $$

The equations shown above constitute the EM Algorithm, where we iterate the value of the parameter $\theta$ until such time that we reach a threshold $\epsilon$ where:
$$ ||\theta(t+1)-\theta(t)|| = \epsilon $$.

The threshold $\epsilon$ therefore gives the accuracy of the result which depends upon the imposed value of the user. A smaller $\epsilon$ will just correspond to more iterations.

I performed the EM Algorithm in Matlab, and I got the following result of the parameter estimates for Figure 1.

These parameters are good estimates with minimal error. If we want to have a smaller error in the estimates then we just have to increase the number of iterations by decreasing the value of $\epsilon$.

The error vs the number of iterations made by the algorithm is shown below.

Figure 2. Error vs the number of iterations for (a) linear scale and (b) logarithmic scale.

 At around 15th iteration, the error stabilized. However, looking at the logarithm scaled plot, at a certain number of iteration (around 50 to 60), a peak is formed, before it eventually falls down.

I tried to change the variance of the dataset so that the tendency for each cluster to overlap is higher, but I still got good estimates. 

Figure 3. Another trial with higher variance. I got the following estimates
which is not bad, considering that the data have many overlaps.

Naturally, higher number of iterations is needed for the program to have a good estimate, as indicated by the following plot of error vs iterations.
Figure 4. Error vs. Iterations for trial 2 (larger variance)



References
[1] S. Theodoridis. Pattern Recognition.  United Kingdom. Academic Press. 1999


Mapping of school-level frequency of disaster occurrence and NAT Scores using Neural Networks

Our task in this activity is to play around with neural network and use it for any fitting or pattern recognition applications. Since I work part-time as a research assistant in a university-based project on interdisciplinary research about the complexity of Philippine Public Schools, I have with me an almost complete dataset of public schools in the Philippines, together with their yearly counts of disaster experiences, evacuation center usage, dropouts, graduation and even NAT (National Achievement Test) Scores. The main plan was to develop a model that describes the resilience of public schools to natural disasters. Intuitively speaking, one would think that if a school always experience flooding or any type of disaster, then class disruptions would result to a lower level overall performance of the school. This hypothesis, however, is not readily observed if one only look at the national-level. National-level data is noisy, and presents zero variances.

My task in the project was to clean the data and look for correlations between variables in the dataset. For a correlation between datasets to exists, it is important to check first the corresponding variances. While national-level analyses do not show relationship between disaster experience and test scores, sub-national levels give a more meaningful result. These results are often masked by other noisy data when we consider the national-level.

In this work, I will look at the relationship between frequency of  experienced typhoon and overall disasters, and the number of times the school is used as an evacuation center to the performance of the schools in the NAT. In particular, I will focus on the data from the province of Leyte.

Will the Neural network be able to predict successfully the probable performance of a school given the disaster variables?

I used the existing Neural Network fitting in Matlab. There are three training algorithms present. For each of the algorithm, I varied the number of hidden layers, and observed the error distribution.

Training algorithms:

A. Levenberg-Marquardt

Here, I used the default number of hidden neurons, which is 10.


Figure A1. Left: Error histogram. Right: Regression analysis for training, validation and tests. 
The error distribution is Gaussian in shape with a mean at about -2.

I tried to get change the number of hidden neurons to 100, and I got the following error histogram.
Figure A2. Error histogram for hidden neurons = 100. 

The mean is still around 2.3, but an outlier is observed at around 109.9, thereby increasing the variance in the distribution.


 B. Scaled Conjugate Gradient
Figure C1.  Left: Error histogram. Right: Regression analysis for training, validation and tests. 

Figure C2. Error histogram for hidden neuron = 100, with the network retrained multiple times. 

Training multiple times will generate different results due to different initial conditions and sampling. As can be seen from Figure C2, sometimes, repeatedly training the network does not always lead to better results.

It should be noted that the these learning algorithms differ in terms of the speed of the training and the amount of memory used in the program. The number of hidden neurons is sometimes increased to improve the performance of the network. However, in cases where the data consists of simpler, smaller number of points, perhaps it is better to use a smaller number of hidden neurons. Based from what I've read, the number of hidden neurons roughly should be between the size of the input and the size of the output [1].

As can be observed from the regression  plots and error histograms, the neural network can indeed predict to a moderate level of accuracy the possible test score of a school given the frequency of disasters it experienced. 


[1] Jeff Heaton. Introduction to Neural Networks for Java.




Martes, Disyembre 15, 2015

Color Feature Extraction as a Method to Track Skin Color

When dealing with image retrieval, indexing, and classification, perhaps the easiest feature to use is the color due to it being visually easy to inspect. However, the large variety in its wide spectrum as well as the unavoidable changes in the illumination during experimentation give rise to certain difficulties that need to be addressed.

In this activity, the goal is to be able to track our face in a real-time video by using our skin as a feature. The additional challenge is that it has to be successful even if the illumination varies from time to time. Like I have previously discussed in the segmentation activity in our Applied Physics 186 course, the success of the segmentation is highly dependent on the region of interest (ROI) that is used in constructing the 2D-histogram below:

Figure 1. Normalized Chromaticity Index (NCC) color space (left) and the corresponding 2D histogram of my skin (right)



If we use a region of interest that only corresponds to a single illumination, most likely the program won't be able to successfully track the skin. To demonstrate this, I tried using two types of ROI. Let's try first using a single illumination as ROI:

(a) Single Illumination in ROI

Figure 2. One region of my face at a single illumination

Using Figure 2 as ROI, I arrived at the following result:


Video 1. Demonstration of face tracking using a single representative ROI under a single illumination

As one may easily observe, the bounding box switches from small to big, suggesting that only portion of my face was detected. At the end of the demonstration, it was able to track the whole face since the illumination is now similar to that of the ROI used.


(b) Varying illumination in ROI

Figure 3. Several regions of my skin at various illumination.


Using different representative ROI from different frames of the vid, I was able to fully track the whole of my face, unlike the result of the previous where only a single cheek is bounded by the bounding box.


As much as I want to post here a real-time face tracker, I'm still in the process of figuring out how to post a user-defined media in blogger. So in the mean time, please bear with the demos first. Ciao! :)

Linggo, Disyembre 13, 2015

Class Separability Measures and ROC Curves

In Theodoridis' introduction in Chapter 5 (Feature Selection), he mentioned that in order to create a good classifier, one must get rid with the "curse of dimensionality". In many cases, one is faced with several features, sometimes far greater than the number of classes. This will only lead to an increase in the computational complexity, but with very little gain.  In doing this, we must first determine how separated the classes are in the features that we have. Thus, we seek to understand and give answer to the following question:
Given a set of data consisting of different classes with corresponding features, how do we measure the degree of their separability?
Quantitatively, we aim to look for features that lead to large between-class distance and small within-class variance.

I. Class Separability Measures

The goal of the first part of the activity is to identify different class separability measures based on how the data is scattered in the l-dimensional space (number of features).
For this activity, I generated the following scatter points.
Figure 1. An example of a scattered data in a 2-dimensional space with the following parameters:


Figure 1. Three classes separated with their corresponding means and covariance matrices

The class separability measures are given by the following:
  • Within-class scatter matrix
$$S_w = \sum_{i=1}^M P_i S_i $$
where $S_i$ and $P_i$ are the covariance matrix and a priori probability for class $w_i$, respectively. We can compute $P_i$ by dividing the total number of samples per class to the overall number of datapoints, so that $P_i = \frac{n_i}{N}$.
From here, it easy to see that the $trace{S_w}$ gives the average of variance, over all classes.
  • Between-class scatter matrix
$$S_b = \sum_{i=1}^M P_i (\mu_i - \mu_O)(\mu_i - \mu_O)^T $$

where $\mu_O$ is the overall mean of the samples, or the global mean vector. This is given by:
$$\mu_O  = \sum{i}^M P_i \mu_i $$

The $trace[S_b]$ gives the average distance of the mean of each individual classes from the respective global value.
  • Mixture scatter matrix
$$S_m = E [ (\textbf{x} - \mu_O)(\textbf{x} - \mu_O)^T $$

which gives us the covariance matrix of the feature vector with respect to the global mean. Alternatively, one can compute $S_m$ by just adding $S_w$ and $S_b$, so that:
$$S_m = S_w + S_b $$.

In parallel with the previous, we see that the $trace[S_m]$ gives the sum of variances of the features around their respective global mean.
I got the following values of Sb and Sw.

From these definitions, a new set of criterion is defined, so as to be able to identify how scattered the data-points with respect to each other and within each classes. These are J1, J2, and J3, defined as follows:
$$J_1 = \frac{trace[S_m]}{trace[S_w]} $$
$$J_2 = \frac{|S_m|}{|S_w|} $$
$$J_3 = trace[\frac{|S_m|}{|S_w|} $$

which have the following values:

From the definitions of $S_m$ and $S_w$, the values of $J_1, J_2, and J_3$ should tell us something about how scattered the data is.

For comparison, I computed another set of $S$ and $J$ values for smaller variances with diagonal covariance matrix. This is for the following scattered data:
Figure 2. Second example of datasets scattered in a two dimensional space with smaller variance in each class.
The corresponding $J$-values for this is given by the following:

As can be seen from the J-values, larger values correspond to smaller within class variance so that the data in the l-dimensional space is well-clustered in their respective means.  It also means that the classes are well separated, away from each other.

II. ROC Curves

Another way of measuring how separated classes are is the use of ROC curves, or the receiver operating characteristics curve. This technique provides us information about the overlap between classes. An example of an ROC curve is shown below:

Figure 3. (a) An example of an overlapped pdf (the other one is inverted), and the corresponding (b) ROC Curve (figure obtained from Ref. [1])

The pdfs in Figure 3. describe the distribution of a feature in two classes. The vertical line represents the chosen threshold, so that those at the left are automatically for class $\omega_1$ and at the right is for class $\omega_2$. The error associated with this assumption is given by the shaded regions in Figure 3 (a).

In this part of the activity, we are tasked to generate two sets of N-randomly distributed values from 0 to 1 corresponding to a single feature, sampled from any arbitrary distribution. In my case, I used a uniform distribution bounded from 0 to 1 to create a normal distribution via the inverse sampling method. Shown in the first column of the figure below are the two distributions with different overlaps depending on the values of the respective mean and variance (the other distribution is inverted for easier inspection). 
Figure 4. ROC Curves for various separations of two classes.

The ROC curve varies with the amount of overlap between distributions. The larger the overlap, the smaller the area bounded in the ROC. In the first example [Fig.2 (a)], the two distributions completely overlapped, leading to an almost zero bounded area in its respective right panel. At the other extreme where no overlap between distributions is observed, the whole upper triangle is shaded.
Thus the ROC curve gives a measure of the class discrimination capability of the specific feature being considered, with values ranging from 0 (complete overlap) to 1/2 (no overlap).


References:
[1] S. Theodoridis. K. Koutroumbas. Pattern Recognition. 4th Ed.  United Kingdom Academic Press, 2009



Martes, Setyembre 15, 2015

Pattern Recognition via Texture

When you google the word texture, it returns the following definition:

noun: 1. the feel, appearance, or consistency of a surface or a substance
"skin texture and tone"
synonyms: feel, touch; appearance, finish, surface, grain; quality, consistency

In reality, one could easily identify a texture of a surface by merely evaluating it through our senses such as touch and sight. In machine learning however, we are only limited by the appearance of the image -- that is, the gradient of graylevel values across an area. In this activity, we use texture as a feature in recognizing patterns in an image. There are already several techniques in analyzing textures of images. In particular, we will be using the local binary pattern technique,  which is essentially non-reversible, meaning one cannot use the final output (that is, the histogram) to get back to the original image. 

Now how do we obtain the LBP feature vector? We follow the following algorithm:
  1.  Compare the graylevel value of a pixel to its eight nearest neighbors. If the neighbor is greater than the pixel itself, we make it equal to 1. For each pixel, there should be 8 binary numbers.
  2. Convert these 8 binary numbers to decimal by multiplying each binary number to their corresponding power of 2 - [1,2,4,8,16,32,64,128]. Sum them all.
  3. Do this for all pixels in the image. The end product should be another image set with pixel values ranging from 0 to 255.
This algorithm can be easier understood by following the following diagram:
Figure 1. LBP algorithm to be implemented

Upon writing the code, I tried it first to a random image I found in google. This is an image of a broken glass, which I find fascinating.
Figure 2. Image of the (left) shattered glass and its (right) corresponding LBP.

As one can see, you could still easily detect the location of the fractures, and the relatively smooth part of the mirror. In order to extract the features, one should obtain the histogram of the LBP values. In textures of the same class (fur, canvass, seeds and so on), the histogram should be relatively the same.
Shown below are just some of the texture classes I processed with their corresponding LBP histogram.
Figure 3. Class: Barley Rice


From the images shown above, one can easily see that the images are of similar textures, but different in colors. As expected, the histograms are relatively the same.

(a) Different canvas used
(b) Corresponding LBP images of canvas
(c) Histogram of the LBP for the canvas class
Figure 3. Sample images of the (a) original canvas used and the corresponding (b) lbp images. The histograms are shown in (c)

It is interesting to see that in Fig. 3, the histogram of the LBP for all canvas images are still relatively similar, regardless of the fact that the spatial patterns in the canvas have different directions/orientations.

I have also applied the algorithm to several other classes such as chips and seeds, among many others.

Refereces:

[1] Local Binary Patterns, Retrieved from  http://www.scholarpedia.org/article/Local_Binary_Patterns, dated September 15, 2015
[2] Images obtained from Outex - University of Oulu  from http://www.outex.oulu.fi/index.php?page=browse and retrieved September 15 2015






Martes, Setyembre 1, 2015

Moment Invariants as a tool in pattern recognition and understanding images

An image moment can be thought of as a weighted average of intensities in image pixels[1]. Just like moments defined in mathematics, it gives us an idea on the shape of a set of points, so that when image moments are known, the corresponding information about the orientation, centroid, and area is also obtained.

For a given digital image $f(x,y)$, the 2D raw moment is defined as:
$$ m_{pq} = \sum_{x=0}^{M-1} \sum_{y=0}^{N-1} x^p y^q f(x,y) $$
where p and q are real valued integers. Alternatively, we can also get the central moments by using the following relation:
$$\mu_{pq} = \sum_{x=0}^{M-1}\sum_{y=0}^{N-1} (x-\overline{x})^p (y-\overline{y})^q f(x,y) $$
where
$$\overline{x} = \frac{m_{10}}{m_{00}} $$ $$\overline{y} = \frac{m_{01}}{m_{00}}$$
From here, we can obtain the normalized central moments by solving for $\eta_{pq}$ defined as
$$\eta_{pq} = \frac{\mu_{pq}}{\mu_{00}^\gamma}$$
for $$\gamma = \frac{p+q}{2} +1 $$ where $$p+q = 2,3...$$

Using Hu's Uniqueness theorem which states that if $f(x,y)$ is a continuous piecewise function and has nonzero values only on the finite part of xy-plane, all moments of different orders exists. It also follows that the moments $m_{pq}$ are uniquely determined from $f(x,y)$, and conversely, we can obtain $f(x,y)$ from the information in the moments $m_{pq}$[2].

In Hu's paper published back in 1962, he enumerated a set of seven moments that are invariant to translation, scale change, mirroring (within a minus sign) and rotation. In this activity, the goal is to calculate for these moments using the following set of images:
1. Grayscale Image
2. Synthetic Edge Image (synthetically generated)
3. Edge Image of a real object
We generate each of these images, and their corresponding rescaled and rotated versions. In total, we should have 9 images. The moments of each of these images are to be computed and are then eventually compared to one another.

We start by showing the following set of images that I generated:

Figure 1. Synthetic images (a) original (b) rotated and (c) scaled down, which are generated for analysis of Hu's invariant moments.


Figure 2. Edges of a real image for (a) original (b) rotated and (scaled down). The image of Bingbong from the movie Inside Out is obtained from google images [3]

Figure 3. Gray images of a cup (a) original (b) rotated and (c) scaled. Image is taken from google images

After generating the images, I applied the program which computes for the corresponding normalized central moments (all seven of Hu's invariant moments, $\phi_1$, $\phi_2$, $\phi_3$, $\phi_4$, $\phi_5$, $\phi_6$, $\phi_7$)

I got the following results:

Table 1. Hu’s invariant moments for real image edge
Moments
Original
Rotated
Rescaled (0.5x)
ϕ1
2.8469
2.8465
2.8469
ϕ2
4.0568
4.0559
4.0568
ϕ3
4.9024
4.8953
4.9024
ϕ4
4.6394
4.6449
4.6394
ϕ5
9.3788
9.3821
9.3788
ϕ6
5.9112
5.929
5.9112
ϕ7
8.3275
8.0973
8.3275


Table 2. Hu’s invariant moments for grayscale image
Moments
Original
Rotated
Rescaled (0.5x)
ϕ1
-6.8124
-6.8124
-6.8125
ϕ2
-21.6559
-21.6559
-21.6559
ϕ3
-26.54
-26.54
-26.54
ϕ4
-26.021
-26.021
-26.0213
ϕ5
-54.504
-54.504
-54.5046
ϕ6
-37.0249
-37.0249
-37.0252
ϕ7
-27.5753
-25.2761
-27.5758




Table 3. Hu’s invariant moments for synthetic image
Moments
Original
Rotated
Rescaled (0.5x)
ϕ1
2.1266
2.1196
2.1266
ϕ2
-11.698
-11.7406
-11.698
ϕ3
-7.8382
-7.8438
-7.8382
ϕ4
-7.7756
-7.6766
-7.7756
ϕ5
-17.3893
-16.7467
-17.3892
ϕ6
-16.1036
-14.7318
-16.1036
ϕ7
-8.8509
-6.5823
-8.8509
The tables above show the corresponding Hu's moments that are seen to be invariant to scaled and rotated images. Notice that for real image edge, the moments are positive, while the rest are negative. Although almost all of the computed  Hu's moments in each of the transformations are observed to be invariant with minimal difference in their values, the 6th and 7th moments differ a little for the rotated transformation.

Another related independent basis set is presented by Flusser, Suk and Zitová [3]. In this work, they identified only 6 invariants and identified which of the 7 Hu's moments is actually dependent.

The following table shows the Flusser-Suk moment invariants for the synthetic edge image.

Table 4. Flusser -Suk moment invariants for synthetic edge image.


[1] Image moment. Wikipedia. Retrieved September 1 2015.

[2] M.-K. Hu. Visual pattern recognition by moment invariants, IRE Trans. on Information Theory, 8(2):179-187, 1962.

[3] J. Flusser, T. Suk, B. Zitová. Moment and Moment Invariants in Pattern Recognition. John Wiley and Sons, 2009.