Saturday 27 September 2014

Programming with RDDs in Spark

RDD(Resilient Distributed Dataset)

An RDD is laid out across the cluster of machines as a collection of partitions, each including a subset of the data. Partitions define the unit of parallelism in Spark. The framework processes the objects within a partition in sequence, and processes multiple partitions in parallel.

In Spark, all work is expressed as
  • either creating new RDDs, 
  • transforming existing RDDs, 
  • or calling actions on RDDs to compute a result.
1. RDD creation
You can create a new RDD by either(1) loading an external dataset by sc.textFile(); or (2) parallelizing an in-memory collection by sc.parallelize() method.

The act of creating a RDD does not cause any distributed computation to take place on the cluster. Rather, RDDs define logical datasets that are intermediate steps in a computation. Distributed computation occurs upon invoking an action on an RDD.

2. RDD operation

Transformations: are operations on RDDs that return a new RDD. It is computed lazily, only when you use them in an action. 

Actions: are operations that return a result back to the driver program or write it to storage, and kick off a computation. Each time we call a new action, the entire RDD must be computed from scratch.

Lineage graph: Spark keeps track of the dependencies between RDDs from transformations. It uses this information to compute each RDD on demand and to recover lost data if a persistent RDD is lost. 

Lazy evaluation: Spark uses lazy evaluation to reduce the number of data passes, and group operations togetherWhen we call a transformation, the operation is not immediately performed. Spark will records meta data to indicate this operation. In fact, each RDD consists of instructions on how to compute the data. Loading data to RDD is also a lazily evaluated.

Persistence: To avoid computing an RDD multiple times, we can ask Spark to persist an RDD, the nodes that compute the RDD store their partitions. If the node fails, Spark will recompute the lost partition of the data.
In Scala and Java, the default persist() will store the data in JVM heap as unserialized objects. When write data to disks, it's always serialized. The persist() call on its own doesn't force evaluation.


val result = input.map(x => x*x)
result.persist(MEMORY_ONLY)
println(result.count())
println(result.collect().mkString(","))



RDDs track lineage info to rebuild lost data

file.map(lambda rec:(rec.type,1))
.reduceByKey(lambda x,y: x+y)
.filter(lambda(type,count): count > 10)


RDD Internal

Externally, an RDD is a distributed immutable collection of objects. Internally, it consists of the following five parts:
  • Set of partitions (rdd.getPartitions)
  • List of dependencies on parent RDDs (rdd.dependencies)
  • Function to compute a partition, given its parents
  • Partitioner (optional) (rdd.partitioner)
  • Preferred location of each partition (optional) (rdd.preferredLocations)


The first three are needed for an RDD to be recomputed, in case the data is lost. When combined, it is called lineage. The last two parts are optimizations.

A set of partitions is how data is divided into nodes. In case of HDFS, it means InputSplits, which are mostly the same as block.

No comments:

Post a Comment