One form of persisting RDD is to cache all or part of the data in JVM heap. Spark’s executors divide JVM heap space into two fractions: one fraction is used to store data persistently cached into memory by Spark application; the remaining fraction is used as JVM heap space, responsible for memory consumption during RDD transformation. We can adjust the ratio of these two fractions using the
spark.storage.memoryFraction
parameter to let Spark control the total size of the cached RDD by
making sure it doesn’t exceed RDD heap space volume multiplied by this
parameter’s value. The unused portion of the RDD cache fraction can also
be used by JVM. Therefore, GC analysis for Spark applications should
cover memory usage of both memory fractions.When an efficiency decline caused by GC latency is observed, we should first check and make sure the Spark application uses the limited memory space in an effective way. The less memory space RDD takes up, the more heap space is left for program execution, which increases GC efficiency; on the contrary, excessive memory consumption by RDDs leads to significant performance loss due to a large number of buffered objects in the old generation.
When GC is observed as too frequent or long lasting, it may indicate that memory space is not used efficiently by Spark process or application. You can improve performance by explicitly cleaning up cached RDD’s after they are no longer needed.
Reference
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
No comments:
Post a Comment