All About Machine Learning: Improved Seeding For Clustering With K-Means++

Friday, November 21, 2014

Improved Seeding For Clustering With K-Means++ | The Data Science Lab

Improved Seeding For Clustering With K-Means++ | The Data Science Lab
If the target distribution is disjointedly clustered and only one instantiation of Lloyd’s algorithm is used, the danger exists that the local minimum reached is not the optimal solution.

This algorithm comes with a theoretical guarantee to find a solution that is O(log k) competitive to the optimal k-means solution.

choose an initial center $c_1$ uniformly at random from X. Compute the vector containing the square distances between all points in the dataset and $c_1$ : $D_i^2 = ||\mathrm{x}_i - c_1 ||^2$
choose a second center $c_2$ from X randomly drawn from the probability distribution $D_i^2 / \sum_j D_j^2$
recompute the distance vector as $D_i^2 = \mathrm{min} \left(||\mathrm{x}_i - c_1 ||^2, ||\mathrm{x}_i - c_2 ||^2\right)$
choose a successive center $c_l$ and recompute the distance vector as $D_i^2 = \mathrm{min} \left(||\mathrm{x}_i - c_1 ||^2, \ldots, ||\mathrm{x}_i - c_l ||^2\right)$
when exactly k centers have been chosen, finalize the initialization phase and proceed with the standard k-means algorithm

class KPlusPlus(KMeans):

    def _dist_from_centers(self):

        cent = self.mu

        X = self.X

        D2 = np.array([min([np.linalg.norm(x-c)**2 for c in cent]) for x in X])

        self.D2 = D2

    def _choose_next_center(self):

        self.probs = self.D2/self.D2.sum()

        self.cumprobs = self.probs.cumsum()

        r = random.random()

        ind = np.where(self.cumprobs >= r)[0][0]

        return(self.X[ind])

    def init_centers(self):

        self.mu = random.sample(self.X, 1)

        while len(self.mu) < self.K:

            self._dist_from_centers()

            self.mu.append(self._choose_next_center())

    def plot_init_centers(self):

        X = self.X

        fig = plt.figure(figsize=(5,5))

        plt.xlim(-1,1)

        plt.ylim(-1,1)

        plt.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5)

        plt.plot(zip(*self.mu)[0], zip(*self.mu)[1], 'ro')

        plt.savefig('kpp_init_N%s_K%s.png' % (str(self.N),str(self.K)), \

                    bbox_inches='tight', dpi=200)

Then, the successive centers are picked, stopping when we have K=5 of them. The procedure to choose the next most suitable center is coded up in the _choose_next_center() function. As we described above, the next center is drawn from a distribution given by the normalized distance vector $D_i^2 / \sum_j D_j^2$ . To implement such a probability distribution, we compute the cumulative probabilities for choosing each of the N points inX. These cumulative probabilities are partitions in the interval [0,1] with length equal to the probability of the corresponding point being chosen as a center, as explained in this stackoverflow thread. Therefore, by picking a random value $r \in [0,1]$ and finding the point corresponding to the segment of the partition where that $r$ value falls, we are effectively choosing a point drawn according to the desired probability distribution. On the right is a plot showing the results of the algorithm for 200 points and 5 clusters.
Read full article from Improved Seeding For Clustering With K-Means++ | The Data Science Lab

All About Machine Learning

Friday, November 21, 2014

Improved Seeding For Clustering With K-Means++ | The Data Science Lab

No comments:

Post a Comment

Popular Posts