- Each node has mutable state. For each record, update state and send new records
- State is lost if node dies. Making stateful stream processing be fault-tolerant is challenging
- Real-time Streaming operation
- Replays record if not processed by a node
- Processes each record at least once
- May update mutable state twice!
- Mutable state can be lost due to failure!
- Micro-batch streaming operation
- Processes each record exactly once
- Per state transaction to external database is slow
Run a streaming computation as a series of very small, deterministic batch jobs
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs and processes them using RDD operations
- Finally, the processed results of the RDD operations are returned in batches
- Batch sizes as low as ½ second, latency ~ 1 second
- Potential for combining batch processing and streaming processing in the same system
- All intermediate data are RDDs, hence can be recomputed if lost due to a worker node dies
- Exactly once semantics to all transformations
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
No comments:
Post a Comment