Wednesday 14 September 2016

Distributed Logging in Spark App



Be careful to not accidentally close over some objects instantiated from your driver's program, like the log object below.


import org.log4j.Logger

dstream.foreachRDD { rdd=>
       rdd.foreachPartition { records =>
       val logger = Logger.getLogger(getClass)
       logger.info("message")
}}

You will see below error:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable -> Caused by: java.io.NotSerializableException: org.apache.log4j.Logger


The usual solution to this type of problems is to instantiate the objects you want to use within your map functions. You can define a factory object that you can create your log object from.
object Holder extends Serializable {      
   @transient lazy val log = Logger.getLogger(getClass.getName)    
}


val someRdd = spark.parallelize(List(1, 2, 3)).foreach { element =>
   Holder.log.info(element)
}

Or using slf4j logFactory

import org.slf4j.LoggerFactory

dstream.foreachRDD { rdd=>
       rdd.foreachPartition { records =>
       val logger = LoggerFactory.getLogger(getClass)
       logger.info("message")
}}



Reference:
http://stackoverflow.com/questions/29208844/apache-spark-logging-within-scala

1 comment:

  1. Really appreciated the information and please keep sharing, I would like to share some information regarding online training.Maxmunus Solutions is providing the best quality of this Apache Spark and Scala programming language. and the training will be online and very convenient for the learner.This course gives you the knowledge you need to achieve success.

    For Joining online training batches please feel free to call or email us.
    Email : minati@maxmunus.com
    Contact No.-+91-9066638196/91-9738075708
    website:-www.maxmunus.com

    ReplyDelete