Thursday 5 March 2015

K-means in Spark Streaming

If the source of the data is constant the streaming algorithm will converge to a similar solution as if k-means was run offline on the entire accumulated data set. If the sources of data are changing over time, how can we make our model reflect those changes?
We have extended the algorithm to support forgetfulness(decay), allowing the model to adapt to changes over time. The key trick is to add a new parameter that balances the relative importance of new data versus past history.  With the appropriate setting of this parameter, we can have cluster centers that smoothly adapt to dynamic changes in the data.


 import org.apache.spark.mllib.clustering.StreamingKMeans

 val model = new StreamingKMeans()
      .setK(args(3).toInt)
      .setDecayFactor(1.0)
      .setRandomCenters(args(4).toInt, 0.0)

 model.trainOn(trainingData)
 model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()

So far, Spark1.1 supports streaming liner regression. Spark 1.2 supports streaming K-means


Reference:
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
https://github.com/apache/spark/blob/branch-1.2/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala

1 comment:

  1. Big data engineering automation should understand the need of Data, and they should work to build more appropriate services to meet the requirements of their clients.

    ReplyDelete