Sunday 15 February 2015

Spark Streaming and DStream


1. Streaming

Spark Streaming offers a micro-batch based streaming model that offers the same rich and expressive capabilities of Spark to streaming data. Spark Streaming consists of two processes:
  • Fetching data; done by the streaming consumer (in our case, the Kafka consumer)
  • Processing the data; done by Spark
These two processes are connected by the timely delivery of collected data blocks from Spark Streaming to Spark.
At any point in time, Spark is processing the previous batch of data, while the Streaming Consumers are collecting data for the current interval. Once a batch has been processed by Spark, it can be cleaned up.
The time to process the data of a batch interval must be less than the batch interval time.





2. Dstream

An DStream is a collection of RDDs over time. Each RDD is composed of partitions of data distributed across the cluster of Spark workers.






In this illustration, each coloured line represents a stream of data. When it's bounded in an rdd, it's a partition of the RDD. The RDD is composed by the data collected at each time interval, represented here by the blue box.
DStream.saveAsTextFile will create a file for the RDD created at interval (bluc box) each part-file corresponds to the piece of the coloured line bounded by such RDD (coloured line within blue blox).
In a distributed file system, like hdfs, the file system will abstract out the partitioning, presenting you with a single logical file, or 1 file per RDD.
When you use the local file system, those part-files will the explicit as you are seeing
There's an RDD produced each interval defined by 'batchDuration' (blue box). The DStream lasts for as long as your streaming program runs (the red line .... could be hours, days, years,...)
In local mode you still have several cores processing tasks in parallel. Each partition of the rdd will become 1 part file. You can use rdd.coalesce(1) to get 1 file pre RDD. Be warned that this will prevent any parallel processing. So it's only applicable for small size files.

(*) Credits for the image: Spark presentation at Devoxx 2014 by Andy Petrella and Gerard Maas 

Reference:

http://www.virdata.com/tuning-spark/
https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/
http://stackoverflow.com/questions/27367814/spark-streaming-saveastextfile-operation-what-are-part0000-files
https://github.com/chimpler/blog-spark-streaming-log-aggregation

No comments:

Post a Comment