Alvin's Big Data Notebook : K-means in Spark Streaming

Thursday, 5 March 2015

K-means in Spark Streaming

If the source of the data is constant the streaming algorithm will converge to a similar solution as if k-means was run offline on the entire accumulated data set. If the sources of data are changing over time, how can we make our model reflect those changes?
We have extended the algorithm to support forgetfulness(decay), allowing the model to adapt to changes over time. The key trick is to add a new parameter that balances the relative importance of new data versus past history. With the appropriate setting of this parameter, we can have cluster centers that smoothly adapt to dynamic changes in the data.

 import org.apache.spark.mllib.clustering.StreamingKMeans

 val model = new StreamingKMeans()
      .setK(args(3).toInt)
      .setDecayFactor(1.0)
      .setRandomCenters(args(4).toInt, 0.0)

 model.trainOn(trainingData)
 model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()

So far, Spark1.1 supports streaming liner regression. Spark 1.2 supports streaming K-means

Reference:
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
https://github.com/apache/spark/blob/branch-1.2/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala

1 comment:

Alfred Avina29 January 2020 at 00:34
Big data engineering automation should understand the need of Data, and they should work to build more appropriate services to meet the requirements of their clients.

ReplyDelete
Replies

Add comment