Tuesday 17 February 2015

Dstream's sliding window in Spark Streaming

Spark will now allow for the creation of “rules” that can run within stream “windows” of time and make decisions with the ease of SQL queries.

A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence of RDDs (of the same type) representing a continuous stream of data. DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume,etc.) or it can be generated by transforming existing DStreams using operations such as `map`, `window` and `reduceByKeyAndWindow`. While a Spark Streaming program is running, each   DStream periodically generates a RDD, either from live data or by transforming the RDD generated by a parent DStream.

DStreams internally is characterized by a few basic properties:
- A list of other DStreams that the DStream depends on
- A time interval at which the DStream generates an RDD
- A function that is used to generate an RDD after each time interval

/**
 * Return a new DStream in which each RDD contains all the elements in seen in a
 * sliding window of time over this DStream.
 * @param windowDuration width of the window; must be a multiple of this DStream's
 * batching interval
 * @param slideDuration  sliding interval of the window (i.e., the interval after
 * which the new DStream will generate RDDs); must be a multiple of this
 * DStream's batching interval
 */
  def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = {
    new WindowedDStream(this, windowDuration, slideDuration)
  }

 val windowed_dstream = dstream.window(new Duration(sliding_window_length),
 new Duration(sliding_window_interval))


 windowed_dstream.foreachRDD(rdd => {
  if (rdd.count > 0) {
    // rules
    ...
  }
}



A new DStream called “windowed_dstream” aggregates incoming stream data into batches that contain all data within the length of the sliding window (in milliseconds). The second parameter specifies how often we want to invoke computation on the data collected (in milliseconds).

Since the window operation aggregated data into a single dstream, it turns out that the data is also aggregated into a single RDD when we use foreachRDD with a sliding window.

DStreams generated by window-based operations are automatically persisted in memory, without the developer calling persist().

Reference: https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
http://rishiverma.com/software/blog/complex-event-processing-using-spark-streaming-and-sparksql/
https://groups.google.com/forum/#!topic/spark-users/GQoxJHAAtX4

No comments:

Post a Comment