Friday, November 21, 2014

K-means++ - Wikipedia, the free encyclopedia


K-means++ - Wikipedia, the free encyclopedia
k-means++ is an algorithm for choosing the initial values (or "seeds") for the k-means clustering algorithm.
It's a way of avoiding the sometimes poor clusterings found by the standard k-means algorithm.
The k-means problem is to find cluster centers that minimize the intra-class variance, i.e. the sum of squared distances from each data point being clustered to its cluster center (the center that is closest to it).
The k-means++ algorithm addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.
The intuition behind this approach is that spreading out the k initial cluster centers is a good thing: the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing cluster center.
The exact algorithm is as follows:
  1. Choose one center uniformly at random from among the data points.
  2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
  3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
  4. Repeat Steps 2 and 3 until k centers have been chosen.
  5. Now that the initial centers have been chosen, proceed using standard k-means clustering.
This seeding method yields considerable improvement in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very quickly after this seeding and thus the algorithm actually lowers the computation time. The authors tested their method with real and synthetic datasets and obtained typically 2-fold improvements in speed, and for certain datasets, close to 1000-fold improvements in error. In these simulations the new method almost always performed at least as well as vanilla k-means in both speed and error.
Additionally, the authors calculate an approximation ratio for their algorithm. The k-means++ algorithm guarantees an approximation ratio O(log k) in expectation (over the randomness of the algorithm), where k is the number of clusters used. This is in contrast to vanilla k-means, which can generate clusterings arbitrarily worse than the optimum. 
Read full article from K-means++ - Wikipedia, the free encyclopedia

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.