Alvin's Big Data Notebook : Storm v.s. Spark Streaming

Thursday, 5 March 2015

Traditional streaming systems have a event-driven record-at-a-time processing model.

Each node has mutable state. For each record, update state and send new records
State is lost if node dies. Making stateful stream processing be fault-tolerant is challenging

1. Storm

2. Trident – Use transactions to update state

3. Spark Streaming

Run a streaming computation as a series of very small, deterministic batch jobs

Chop up the live stream into batches of X seconds
Spark treats each batch of data as RDDs and processes them using RDD operations
Finally, the processed results of the RDD operations are returned in batches
Batch sizes as low as ½ second, latency ~ 1 second
Potential for combining batch processing and streaming processing in the same system
All intermediate data are RDDs, hence can be recomputed if lost due to a worker node dies
Exactly once semantics to all transformations

Reference:
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

Alvin's Big Data Notebook