Sunday, 28 December 2014

Accumulator variables in Spark

A shared variable, accumulators, provides a simple syntax for aggregating values from worker nodes back to the driver program. One of the most common uses of accumulators is to count events that occur during job execution for debugging purposes.
  • Accumulators are variables that can only be “added” to through an associative operation used to implement counters and sums, efficiently in parallel.
  • Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types.
  • Only the driver program can read an accumulator’s value, using its value() method, not the tasks

val accum = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value

Note that tasks on worker nodes cannot access the accumulator’s value()—from the point of view of these tasks, accumulators are write-only variables.

The type of counting shown here becomes especially handy when there are multiple values to keep track of, or when the same value needs to increase at multiple places in the parallel program

Reference:
http://spark.apache.org/docs/latest/programming-guide.html#transformations

No comments:

Post a Comment