We have extended the algorithm to support forgetfulness(decay), allowing the model to adapt to changes over time. The key trick is to add a new parameter that balances the relative importance of new data versus past history. With the appropriate setting of this parameter, we can have cluster centers that smoothly adapt to dynamic changes in the data.
import org.apache.spark.mllib.clustering.StreamingKMeans val model = new StreamingKMeans() .setK(args(3).toInt) .setDecayFactor(1.0) .setRandomCenters(args(4).toInt, 0.0) model.trainOn(trainingData) model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
So far, Spark1.1 supports streaming liner regression. Spark 1.2 supports streaming K-means
Reference:
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
https://github.com/apache/spark/blob/branch-1.2/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala
Big data engineering automation should understand the need of Data, and they should work to build more appropriate services to meet the requirements of their clients.
ReplyDelete