Alvin's Big Data Notebook : map vs flatMap vs reduce in Spark

Saturday, 27 December 2014

map vs flatMap vs reduce in Spark

1. map

The map transformation takes in a function and applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD. The return type of the map does not have to be the same as the input type

val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x*x)
println(result.collect())

2. flatMap

flatMap produces multiple output elements for each input element.
Like with map, the function we provide to flatMap is called individually for each element in our input RDD.
Instead of returning a single element, we return an iterator with our return values.
Rather than producing an RDD of iterators, we get back an RDD which consists of the elements from all of the iterators.
Often used to extract words.

val lines = sc.parallelize(List("hello world", "hi"))
val words = lines.flatMap(line => line.split(" "))
words.first()  // returns "hello"

3. reduce

Reduce takes in a function which operates on two elements of the same type of your RDD and returns a new element of the same type. Reduce requires that the return type of our result be the same type as that of the RDD we are operating over.

val sum = rdd.reduce((x, y) => x + y)

Alvin's Big Data Notebook

Saturday, 27 December 2014

map vs flatMap vs reduce in Spark

No comments:

Post a Comment