Improved Seeding For Clustering With K-Means++ | The Data Science Lab
If the target distribution is disjointedly clustered and only one instantiation of Lloyd’s algorithm is used, the danger exists that the local minimum reached is not the optimal solution.
This algorithm comes with a theoretical guarantee to find a solution that is O(log k) competitive to the optimal k-means solution.
Then, the successive centers are picked, stopping when we have K=5 of them. The procedure to choose the next most suitable center is coded up in the
Read full article from Improved Seeding For Clustering With K-Means++ | The Data Science Lab
If the target distribution is disjointedly clustered and only one instantiation of Lloyd’s algorithm is used, the danger exists that the local minimum reached is not the optimal solution.
This algorithm comes with a theoretical guarantee to find a solution that is O(log k) competitive to the optimal k-means solution.
- choose an initial center uniformly at random from X. Compute the vector containing the square distances between all points in the dataset and :
- choose a second center from X randomly drawn from the probability distribution
- recompute the distance vector as
- choose a successive center and recompute the distance vector as
- when exactly k centers have been chosen, finalize the initialization phase and proceed with the standard k-means algorithm
class
KPlusPlus(KMeans):
def
_dist_from_centers(
self
):
cent
=
self
.mu
X
=
self
.X
D2
=
np.array([
min
([np.linalg.norm(x
-
c)
*
*
2
for
c
in
cent])
for
x
in
X])
self
.D2
=
D2
def
_choose_next_center(
self
):
self
.probs
=
self
.D2
/
self
.D2.
sum
()
self
.cumprobs
=
self
.probs.cumsum()
r
=
random.random()
ind
=
np.where(
self
.cumprobs >
=
r)[
0
][
0
]
return
(
self
.X[ind])
def
init_centers(
self
):
self
.mu
=
random.sample(
self
.X,
1
)
while
len
(
self
.mu) <
self
.K:
self
._dist_from_centers()
self
.mu.append(
self
._choose_next_center())
def
plot_init_centers(
self
):
X
=
self
.X
fig
=
plt.figure(figsize
=
(
5
,
5
))
plt.xlim(
-
1
,
1
)
plt.ylim(
-
1
,
1
)
plt.plot(
zip
(
*
X)[
0
],
zip
(
*
X)[
1
],
'.'
, alpha
=
0.5
)
plt.plot(
zip
(
*
self
.mu)[
0
],
zip
(
*
self
.mu)[
1
],
'ro'
)
plt.savefig(
'kpp_init_N%s_K%s.png'
%
(
str
(
self
.N),
str
(
self
.K)), \
bbox_inches
=
'tight'
, dpi
=
200
)
Then, the successive centers are picked, stopping when we have K=5 of them. The procedure to choose the next most suitable center is coded up in the
_choose_next_center()
function. As we described above, the next center is drawn from a distribution given by the normalized distance vector . To implement such a probability distribution, we compute the cumulative probabilities for choosing each of the N points inX. These cumulative probabilities are partitions in the interval [0,1] with length equal to the probability of the corresponding point being chosen as a center, as explained in this stackoverflow thread. Therefore, by picking a random value and finding the point corresponding to the segment of the partition where that value falls, we are effectively choosing a point drawn according to the desired probability distribution. On the right is a plot showing the results of the algorithm for 200 points and 5 clusters.Read full article from Improved Seeding For Clustering With K-Means++ | The Data Science Lab
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.