1. Stragglers due to slow nodes.
Some tasks are just slower than others. Easy to identify from Summary Metrics from Web UI.
Set --conf "spark.speculation=true" to mitigate this problem.
2. Stragglers due to data skew
Pick a different algorithm or restructure the data
3. Tasks are slow due to Garbage collection
Look at the "GC Time" column in the Web UI of Tasks table.
Set "spark.executor.extraJavaOptions" to include: "-XX:-PrintGCDetails --XX:+PrintGCTimeStamps"
Then, look at spark/work/app_id/[n]/stdout on executors
4. Too many open files
Hash-based shuffle: generate multiple partition files
Sort-based shuffle: generate a single file containing multiple partition records.
conf.set("spark.shuffle.manager", SORT)
5. Number of Tasks/partions
Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk.
--conf "spark.sql.shuffle.partitions=20"
ReplyDeleteBig Data Hadoop Online Training