Monday, October 3, 2011

Approximating reality - Gaussian mixture modeling

One of the hidden beauties of the field of statistics is the central limit theorem. When data from various different distributions is combined (summing, or averaging) the resulting distribution approaches the normal distribution. This is one of those near magical aspects of statistics when you first are introduced to the topic. However, with some thought and practice the concept becomes clear. (To help elucidate the concept, check out this java applet with various distribution sampling). As a contrast to the inherent trend to the normal distribution, if a sampling comes from entirely different sets of distributions of data, we end up with an overall nonlinear distribution which is difficult to deal with.

For instance (imagine the world is a line segment for ease of visualization) picture a robot traveling along a line with two sensors detecting the position of the robot. If these sensors are corrupted with normally distributed noise, then we cannot be certain where the robot is precisely. We just have a probabilistic notional sense of where we are. If we didn’t have a clue which sensor the data was arriving from, then we’d have a smear of data across the x-axis. The data would be clustered where each sensor thought the robot was, but viewing the data as a whole, we would just see lots of points. Ok, now where am I going with this? What if we want to determine which sensor contained which data found… we have to separate out the data points and find the individual distributions which resulted in the overall data stream generated.


One potential method of analysis of this data is the Gaussian mixture model. In a nutshell the mixture model technique determines which sets each of the data points is most likely to belong too and creates individual normal distributions (any distribution could be chosen though) to represent the data. Technically, the Gaussian mixture modeling algorithm relies on expectation maximization which places each data point with other that have the maximum expectation for a certain set. You can read more about this on Wikipedia or in this paper. Just to give you all a visual of the capabilities of the Gaussian mixture modeling technique, I created a basic simulation with three mixtures that are clustered out.

0 comments: