Monday 29 December 2014

Create and Read Cache in Spark

1. Create a cache


1. After rdd.cache() is called, rdd becomes persistRDD, its StorageLevel is MEMORY_ONLY.

2. When rdd.iterator() is called, a partition is computed. Retrieve a blockId from CacheManager.

3. Check whether this partition is checkpoinited in BlockManger. If yes, it means the task has been computed before, read the records of that partition from checkpoint directly.

4. Otherwise, calculate the partition, put all its records in an Elements, then cached by BlockManager. BlockManager stores Elements in LinkedHashMap[BlockId, Entry], which is managed by MemoryStore.


2. Read from cache



1. When compute a partition in a RDD, check whether it is cached from BlockManager.
If the partition is cached in local, call blockManager.getLocal() to read from memoryStore in local. Otherwise, call blockManager.getRemote() to read from other nodes.

2. After a partition is cached in a node, whose blockManager will send this info to Driver's blockMangerMasterActor's blockLocations. When the RDD is needed, call blockManagerMaster.getLocations(blockId) to lookup partition's info. Driver lookups blockLocations , then send the info to task.

3. Once task get the cached partition's location info, calls GetBlock(blockId) through connectionManager to the node holding the cached partition. The target node reads the partition from its blockManager's memoryStore, and send it to task.

Reference:
The contents of this article are from https://github.com/JerryLead/SparkInternals

No comments:

Post a Comment