Sunday 22 November 2015

Glom in Spark RDD

glom() Return an RDD created by coalescing all elements within each partition into an array.

For example, to get the maximum value of a RDD.

val maxValue = dataRDD.reduce(_ max _)

There will be lot of shuffles between partitions for comparison.
Rather than comparing all the values,
1. Find the maximum in each partition
2. Compare maximum value between partitions to get the final max value.

val maxValue = dataRDD.glom().map((row: Array[Double]) => value.max).reduce(_ max _)


Reference:
http://blog.madhukaraphatak.com/glom-in-spark/

No comments:

Post a Comment