> <\body> <\hide-preamble> \; 222>> <\doc-data||> \; <|author-data> \; >> \; \; A random variable > with values in > exhibiting two clusters is modelled by assuming that it has a probability distribution that is a mixture of two Gaussians. For initial purposes we assume the samples from the individual Gaussians to be labelled accordingly, i.e. each sample is in state =1> or =2>, where > is called the . Then the generation process of > is as follows: <\enumerate> Choose either the first or the second Gaussian by using the (given) distribution of the label, =i|)>>, . For brevity, we write =i|)> =p>, where +p=1>. Generate a sample according to this Gaussian using the common probability distribution function. Hence the density of the random variable > is: <\eqnarray*> \|\> >||p\\>>\exp|)>|2\>|)>>>|||\=k|)>\f\|\\,\=k>>>>> where =\, \, \|)>> is the parameter vector and we denote =\\ \\|)>>. We would like to get a grip on the probability distribution of the parameters > and > (we regard > as fixed). For that matter, we assume a

>> and accept the following black box Bayesian formulas for densities: <\framed> <\itemize> If and are continuously distributed according to their densities > and >, then <\equation*> f=\f|f> If is discretely distributed with discrete probabilities |)>=\> and Y is continuously distributed according to its density > and also the conditional density > is known, then <\equation*> \\|Y=y|)>=>|)>\\|f> Then we can write <\eqnarray*> =1\|\\\\,\=x|)>>||\|\,\=1>\p|f\|\>>>>|||=1|)>\f\|\,\=1>|\=k|)>\ f\|\\,\=k>>>>|||-\|)>|\>+-\|2\>+>log|p>|)>|)>>>>|=2\|\\\\,\=x|)>>||-\|)>|\>--\|2\>-log|p>|)>|)>>>>>> For brevity we denote <\eqnarray*> >|>|=k\|\\\\,\=x|)>>>>> Now we assume that the parameters > are unknown and we wish to infer them from the sample |}>>. We can derive <\eqnarray*> \|\=|}>>,\|)>>||\|\=,\|)>>|}>|)>\f>,\|)>|f>|}>|)>>>>>> and <\equation*> f\|\=,\|)>>|}>|)> = f\|\>|)>. We can reason that the most probable guess for > is the maximum of the product in the numerator of the fraction above. If we assume a non-committal prior distribution >,\|)>>, we need to maximize the conditional density of > given >, i.e. the >>. It will be easier to maximize the natural logarithm of this quantity and we denote for brevity <\eqnarray*> |)>>|>|\|\=,\|)>>|}>|)>|)>.>>>> To find the maximum of , we use the Newton method on the gradient of . For that we have to find the gradient and the Hessian of first. <\with|eqn-row-sep|2.5fn> <\eqnarray*> |\\>L|)>>||p> \ -\|\>>>||\\>L|)>>||p>\> +|\\>p> \ -\|\>>>||>|p>\>>>>> where we will use the approximation in the last line and the Hessian is thus (approximately) the 2> diagonal matrix <\equation*> H\-p>\> \ Id The Newton method thus is <\eqnarray*> >||-H\p> \ -\|\>|)>>>|||+p> \ x|p>> - p> \ \|p>> >>||=|p> \ x|p>>>>>> \; Intuitively, this means putting the new cluster centers to the probabilistically weighted center of mass of all data points. The weighing is according to ``responsibility'' of a cluster for a data point, i.e. data points that are regarded as unrelated to a cluster will not have much influence for its new center. The algorithm consists of two parts: First, we need to calculate the likelihood that the data set is a result of the current guess of the parameters. This means getting the values of all >. We can interpret the > as : Of course, > and y> add to 1, so if one of them is near , we say that this cluster takes for >. This step is also called > as we fuzzily assign clusters (we will in general not have =1> and =0> for most samples, so the responsibility is ``fuzzy''). Then we need to update our current guess for > by the formula above. This step is called