Alvin's Big Data Notebook : Introduction to Samza

Samza has been an Apache incubator project since September 2013.

Samza is a processing system on top of Kafka, allowing us to react to the messages — to join, filter, and count the messages. The new processing system, Apache Samza, solved our batch processing latency problem and has allowed us to process data in near real-time.Samza continuously computes results as data arrives which makes sub-second response times possible.

1. Architecture

A stream is composed of immutable sequences of messages of a similar type or category. In order to scale the system to handle large-scale data, we break down each stream into partitions.

A job is the code that consumes and processes a set of input streams. In order to scale the throughput of the stream processor, jobs are broken into smaller units of execution called Tasks. Each task consumes data from one or more partitions for each of the job’s input streams. Since there is no defined ordering of messages across the partitions, it allows tasks to operate independently.

Samza assigns groups of tasks to be executed inside one or more containers – UNIX processes running a JVM that execute a set of Samza tasks for a single job. Samza’s container code is single threaded (when one task is processing a message, no other task in the container is active), and is responsible for managing the startup, execution, and shutdown of one or more tasks.

Samza’s architecture is composed of 3 components:

A streaming layer — responsible for providing partitioned streams that are replicated and durable
An execution Layer —responsible for scheduling and coordinating tasks across the machines
A processing Layer — responsible for processing the input stream and applying transformations

2. Samza’s most unique feature is its approach to managing processor state.

Some stream processing tasks are stateless and operate on one record at a time, but other uses such as counts, aggregation or joins over a window in the stream require state to be buffered in the system.Unlike other stream processing systems that read state remotely from a database, which can cause throughput problems and breaks isolation, Samza stores the state for each task locally on disk on the same machine that the Samza container is running on. This drastically improves read performance, which help make stream processing with state much easier. If you are doing stateful processing, Samza is likely a good fit for your use case. All writes to the local data-store are replicated to a durable change-log stream (typically, Kafka). When a machine fails, the task can consume the changelog stream to restore the contents of the local data-store to a consistent.

3. Samza is tightly integrated with Apache Kafka.
Samza has a much stronger model for streams: they must be ordered, highly available, partitioned, and durable. These strong requirements fit perfectly with Kafka’s stream model, and allow Samza to push a lot of hard problems into the underlying stream layer. This heavy reliance on Kakfa, and Samza’s conscious effort to integrate with Kafka’s full feature set, makes Samza a great fit if your organization is already running Kafka.

Reference:
http://getprismatic.com/story/1420685185990

Alvin's Big Data Notebook

Friday, 9 January 2015

Introduction to Samza

No comments:

Post a Comment