Tuesday, November 18, 2014

Segmenting Audience with KMeans and Voronoi Diagram using Spark and MLlib | Chimpler


Segmenting Audience with KMeans and Voronoi Diagram using Spark and MLlib | Chimpler
We will be using the k-means clustering algorithm implemented in Spark Machine Learning Library(MLLib) to segment the dataset by geolocation .

The k-mean clustering algorithm is an unsupervised algorithm meaning that you don’t need to provide a training example for it to work(unlike neural network, SVM, Naives Bayes classifiers, …). It partitions observations into clusters in which each observation belongs to the cluster with the nearest mean. The algorithm takes as input the observations, the number of clusters(denoted k) that we want to partition the observation into and the number of iterations. It gives as a result the centers of the clusters.
The algorithm works as follow:
  1. Take k random observations out of the dataset. Set the k centers of the clusters to those points
  2. For each observation, find the cluster center which is the closest and assign this observation to this cluster
  3. For each cluster, compute the new center by taking the average of the features of the observations assigned to this dataset
  4. Go back to 2 and repeat this for a given number of iterations
The centers of the clusters will converge and will minimize the cost function which is the sum of the square distance of each observation to their assigned cluster centers.
This minimum might be a local optimum and will depend on the observation that were randomly taken at the beginning of the algorithm.
In this post, we are going to listen to a tweet stream to get tweets with their geolocation and then apply the k-means algorithm on their coordinates to find geographical clusters.

Fetching the tweets

Twitter provides an API to continuously listen to a stream of tweets. In order to use the API, you need a twitter api keys and access tokens.
To get those, log in on https://apps.twitter.com/
Click on “Create New App”. Fill in the Name, Description, Website, Callback URL and click on “Create your twitter application”.

Running kmeans with spark

To run the k-means algorithm in Spark, we need to first read the csv file
1
2
3
4
5
6
7
val sc = new SparkContext("local[4]", "kmeans")
// Load and parse the data, we only extract the latitude and longitude of each line
val data = sc.textFile(arg)
val parsedData = data.map {
  line =>
    Vectors.dense(line.split(',').slice(0, 2).map(_.toDouble))
}
Then we can run the spark kmeans algorithm:
1
2
3
val iterationCount = 100
val clusterCount = 10
val model = KMeans.train(parsedData, clusterCount, iterationCount)
The model object exhibits the following methods:
  • clusterCenters: returns the centers of each cluster
  • cost: returns the cost (sum of square distance of each tweets from its cluster center). The lower it is, the better.
  • predict(Vector): return the cluster id closest to the vector point
From the model we can get the cluster centers and group the tweets by cluster:
1
2
3
4
5
6
7
8
9
10
val clusterCenters = model.clusterCenters map (_.toArray)
 
val cost = model.computeCost(parsedData)
println("Cost: " + cost)
 
val tweetsByGoup = data
  .map {_.split(',').slice(0, 2).map(_.toDouble)}
  .groupBy{rdd => model.predict(Vectors.dense(rdd))}
  .collect()
sc.stop()
Note that if you run kmeans multiple times on the same data set, you can have different results as the cluster centers are initialized at random at the beginning of the algorithm.
Read full article from Segmenting Audience with KMeans and Voronoi Diagram using Spark and MLlib | Chimpler

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.