Friday, November 21, 2014

Improved Seeding For Clustering With K-Means++ | The Data Science Lab


Improved Seeding For Clustering With K-Means++ | The Data Science Lab
If the target distribution is disjointedly clustered and only one instantiation of Lloyd’s algorithm is used, the danger exists that the local minimum reached is not the optimal solution.

This algorithm comes with a theoretical guarantee to find a solution that is O(log k) competitive to the optimal k-means solution.
  • choose an initial center c_1 uniformly at random from X. Compute the vector containing the square distances between all points in the dataset and c_1D_i^2 = ||\mathrm{x}_i - c_1 ||^2
  • choose a second center c_2 from X randomly drawn from the probability distribution D_i^2 / \sum_j D_j^2
  • recompute the distance vector as D_i^2 = \mathrm{min} \left(||\mathrm{x}_i - c_1 ||^2, ||\mathrm{x}_i - c_2 ||^2\right)
  • choose a successive center c_l and recompute the distance vector as D_i^2 = \mathrm{min} \left(||\mathrm{x}_i - c_1 ||^2, \ldots, ||\mathrm{x}_i - c_l ||^2\right)
  • when exactly k centers have been chosen, finalize the initialization phase and proceed with the standard k-means algorithm
class KPlusPlus(KMeans):
    def _dist_from_centers(self):
        cent = self.mu
        X = self.X
        D2 = np.array([min([np.linalg.norm(x-c)**2 for c in cent]) for x in X])
        self.D2 = D2
 
    def _choose_next_center(self):
        self.probs = self.D2/self.D2.sum()
        self.cumprobs = self.probs.cumsum()
        r = random.random()
        ind = np.where(self.cumprobs >= r)[0][0]
        return(self.X[ind])
 
    def init_centers(self):
        self.mu = random.sample(self.X, 1)
        while len(self.mu) < self.K:
            self._dist_from_centers()
            self.mu.append(self._choose_next_center())
 
    def plot_init_centers(self):
        X = self.X
        fig = plt.figure(figsize=(5,5))
        plt.xlim(-1,1)
        plt.ylim(-1,1)
        plt.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5)
        plt.plot(zip(*self.mu)[0], zip(*self.mu)[1], 'ro')
        plt.savefig('kpp_init_N%s_K%s.png' % (str(self.N),str(self.K)), \
                    bbox_inches='tight', dpi=200)

 Then, the successive centers are picked, stopping when we have K=5 of them. The procedure to choose the next most suitable center is coded up in the _choose_next_center() function. As we described above, the next center is drawn from a distribution given by the normalized distance vector D_i^2 / \sum_j D_j^2. To implement such a probability distribution, we compute the cumulative probabilities for choosing each of the N points inX. These cumulative probabilities are partitions in the interval [0,1] with length equal to the probability of the corresponding point being chosen as a center, as explained in this stackoverflow thread. Therefore, by picking a random value r \in [0,1] and finding the point corresponding to the segment of the partition where that r value falls, we are effectively choosing a point drawn according to the desired probability distribution. On the right is a plot showing the results of the algorithm for 200 points and 5 clusters.
Read full article from Improved Seeding For Clustering With K-Means++ | The Data Science Lab

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.