Figure 1. Example of a multimodal distribution consisting of N=3000, and J=3, with the following parameters:
The EM Algorithm is basically an iterative method for finding the maximum likelihood of parameter estimates in an unknown incomplete data.
Suppose we do not know anything about the parameters that make up the multimodal set of data shown in Figure 1. We may assume the number of Gaussian distributions present in the two-dimensional space. In my work, I used J=3 (3 clusters/Gaussian distributions). We also need to impose initial parameter guess for $\theta(0) = [\mu, \sigma^2, P] $. The $\mu$, $\sigma$ and $P$ are the corresponding means, variances, and a priori probability for each cluster J.
Then, we have to compute for the conditional probability given by:
$$p(x_k| j ; \theta) = \frac{1}{(2\sigma_j^2)^{(\ell/2)}} exp(-\frac{|| x_k - \mu_j ||^2}{2\sigma^2})$$
where $\ell$ here is the dimension of the space (in our case, l=2). Then, we substitute this to:
$$ P(j | x_k; \Theta(t)) = \frac{p(x_k| j ; \theta(t)) P_j(t)}{p(x_k, \Theta(t)} $$
where we take $p(x_k, \Theta(t))$ for normalization, so that:
$$p(x_k, \Theta(t)) = \sum_{j=1}^J p(x_k| j ; \theta(t)) P_j(t) $$
We then use the preceding equations to compute for the means, variances and a priori probability P of the Gaussian mixtures by performing the maximization step as follows:
We take:
$$\mu_j (t+1) = \frac{\sum_{k=1}^N P(j | x_k; \Theta(t)) x_k}{\sum_{k=1}^N P(j | x_k; \Theta(t))} $$
$$\sigma_d^2 = \frac{\sum_{k=1}^N P(j | x_k; \Theta(t)) ||x_k - \mu_j(t+1)||^2}{\ell \sum_{k=1}^N P(j | x_k; \Theta(t))} $$
$$P_j (t+1) = \frac{1}{N} \sum_{k=1}^N P(j | x_k; \Theta(t)) $$
The equations shown above constitute the EM Algorithm, where we iterate the value of the parameter $\theta$ until such time that we reach a threshold $\epsilon$ where:
$$ ||\theta(t+1)-\theta(t)|| = \epsilon $$.
The threshold $\epsilon$ therefore gives the accuracy of the result which depends upon the imposed value of the user. A smaller $\epsilon$ will just correspond to more iterations.
I performed the EM Algorithm in Matlab, and I got the following result of the parameter estimates for Figure 1.
These parameters are good estimates with minimal error. If we want to have a smaller error in the estimates then we just have to increase the number of iterations by decreasing the value of $\epsilon$.
The error vs the number of iterations made by the algorithm is shown below.
Figure 2. Error vs the number of iterations for (a) linear scale and (b) logarithmic scale.
At around 15th iteration, the error stabilized. However, looking at the logarithm scaled plot, at a certain number of iteration (around 50 to 60), a peak is formed, before it eventually falls down.
I tried to change the variance of the dataset so that the tendency for each cluster to overlap is higher, but I still got good estimates.
Figure 3. Another trial with higher variance. I got the following estimates
which is not bad, considering that the data have many overlaps.
Naturally, higher number of iterations is needed for the program to have a good estimate, as indicated by the following plot of error vs iterations.
Figure 4. Error vs. Iterations for trial 2 (larger variance)
References
[1] S. Theodoridis. Pattern Recognition. United Kingdom. Academic Press. 1999