Friday, 11 September 2015

Some RDD methods are inefficient for large Dataset

1. Avoid GroupByKey

If your RDD is so large that all of it's elements won't fit in memory on the drive machine, don't do this:

  • countByKey
  • countByValue
  • collectAsMap
  • collect

2. Avoid copying large RDD to driver node

when calling groupByKey - all the key-value pairs are shuffled around. This is a lot of unnessary data to being transferred over the network.
Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory. However, it flushes out the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory, an out of memory exception occurs.
Here are more functions to prefer over groupByKey:
  • combineByKey can be used when you are combining elements but your return type differs from your input value type.
  • foldByKey merges the values for each key using an associative function and a neutral "zero value".



Reference:
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html

No comments:

Post a Comment