Thursday, 5 March 2015

Storm v.s. Spark Streaming

Traditional streaming systems have a event-driven record-at-a-time processing model.
  • Each node has mutable state. For each record, update state and send new records
  • State is lost if node dies. Making stateful stream processing be fault-tolerant is challenging
1. Storm
  • Real-time Streaming operation
  • Replays record if not processed by a node
  • Processes each record at least once
  • May update mutable state twice!
  • Mutable state can be lost due to failure!
2. Trident – Use transactions to update state
  • Micro-batch streaming operation
  • Processes each record exactly once
  • Per state transaction to external database is slow
3. Spark Streaming

Run a streaming computation as a series of very small, deterministic batch jobs

  • Chop up the live stream into batches of X seconds 
  • Spark treats each batch of data as RDDs and processes them using RDD operations
  • Finally, the processed results of the RDD operations are returned in batches
  • Batch sizes as low as ½ second, latency ~ 1 second
  • Potential for combining batch processing and streaming processing in the same system
  • All intermediate data are RDDs, hence can be recomputed if lost due to a worker node dies
  • Exactly once semantics to all transformations
Reference:
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

No comments:

Post a Comment