Each job has multiple stages, the last one generates results.
DAGScheduler will submit stages without parent stages first, and determines the number of tasks and submit tasks. Then, submit the stages with parent stages.
1. Job Submission
1. Generate job logic plan
Driver program calls transformation() to create computing chain(a serial of RDDs).
The compute() of each RDD defines how to generate the partitions.
getDependencies() generates the relationships between the partitions of RDDs.
2. Generate job physical plan
Each action() triggers a job.
Create stages in DAGScheduler.runJob().
Determine ShuffleMapTasks or ResultTasks in submitStage(), then pack tasks into TaskSet for taskScheduler.
3. Allocate Tasks
After sparkDeploySchedulerBackend receives taskSet, send serializedTask to work node's CoarseGrainedExecutorBackend Actor by DriverActor.
4. Job Receive
Executor wraps task into taskRunner, then start a idle thread to run the task.
Each CoarseGrainedExecutorBackend has only one executor object.
2. Task Execution
1. After executor gets the serialized task, deserialize it, then run the task to get direct Result.
2. If the result is small, send back to driver directly. Otherwise, store the result in local memory and disk managed by BlockManager. Instead sending the indirectResult to driver. Driver will fetch it through HTTP if needed.
3. After driver gets the task result, it tells taskScheduler this task is finished and analyze the result.
4. If result from ResultTask, call resultHandler to compute it in driver, e.g count().
If result is MapStatus of ShuffleMapTask, store it to mapOutputTrackerMaster for lookuping in reducer shuffle.
5. If the task is the last one in a stage, then submit the next stage.
If the stage is the last one in a job, tell DAGScheduler the job is finished.
Reference:
The contents of this article are from https://github.com/JerryLead/SparkInternals
No comments:
Post a Comment