Monday, November 17, 2014

Apache Spark User List - Machine Learning on streaming data


Apache Spark User List - Machine Learning on streaming data

Here's a gist with two examples I have working, one for StreamingLinearRegression and another for StreamingKMeans.


The goal in each case was to implement a streaming version of the algorithm, using as much as possible directly from MLLib. For Linear Regression this was straightforward, because the MLLib version already uses a (stochastic) update rule, which I just use to update the model inside a foreachRDD(), using each new batch of data. For KMeans, I used the model class from MLLib, but extended it to keep a running count for each cluster. I also had to re-implement a chunk of the core algorithm in the form of an update rule. Tighter integration in this case would, I think, require refactoring some of MLLib (e.g. to use something like this update function), but this works fine.

One unresolved issue: for these kinds of algorithms, the dimensionality of the data must be known in advance. Would be cool to automatically detect it based on the first record.

Read full article from Apache Spark User List - Machine Learning on streaming data

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.