If your RDD is so large that all of it's elements won't fit in memory on the drive machine, don't do this:
countByKey
countByValue
collectAsMap
collect
2. Avoid copying large RDD to driver node
when calling
groupByKey
- all the key-value pairs are shuffled around. This is a lot of unnessary data to being transferred over the network.Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory. However, it flushes out the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory, an out of memory exception occurs.
Here are more functions to prefer over
groupByKey
:combineByKey
can be used when you are combining elements but your return type differs from your input value type.foldByKey
merges the values for each key using an associative function and a neutral "zero value".
Reference:
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html
No comments:
Post a Comment