Monday, 10 April 2017

Kafka Streams

Lambda vs Kappa Architecture

  • A scalable high-latency batch system that can process historical data and a low-latency stream processing system that can't reprocess results.
  • Use Kafka to retain the full log of the data you want to be able to reprocess Retaining large amounts of data in kafka is a perfectly natural and economical thing to do and won't hurt performance.

The KStreams DSL is composed of two main abstractions; the KStream and KTable interfaces.
  • A KStream is a record stream where each key-value pair is an independent record. Later records in the stream don’t replace earlier records with matching keys. 
  • A KTable on the other hand is a “changelog” stream, meaning later records are considered updates to earlier records with the same key.
Local State Store:

Kafka streams provide an efficent way to model the application state.
The state store is partitioned the same way as the application's key space. As a result, all the data required to serve the queries that arrive at a particular application instance are available locally in the state store shards. Fault tolerance for this local state store is provided by kafka streams by logging all updates made to the state store, transparently, to a highly-available and durable kafka topic.
Kafka streams uses kafka like a commit log for its local, embedded database.

The aggregation is done on each instance with a local state. Then, the results are written to a new topic with a single partition, which will be read by a single application instance. This type of multi-phase processing is very familiar to those  mapreduce.

Interactive Queries allow us to treat the stream processing layer as a lightweight, embedded database and directly query the state of your stream processing application, without needing to materialize that state to external databases or storage.

Kafka steams simplifies the stream processing architecture by eliminating the need for a separate cluster. These embedded databases act as materialized views of logs.
Materialized views provide better application isolation because they are part of an application's state, also provide better performance.

Interactive Queries enables faster and more efficient use of the application state. There is no duplication of data between the store doing aggregation for stream processing and the store answering queries.

Discovering any instance's stores

In kafka, each instance may expose its endpoint information metadata to other instances of the same application. The IQ APIs allow a developer to obtain the metadata for a given store name and key, and do that across the instances of an application. Then, we can discover where the store is that holds a particular key by examining the metadata.

Event streams are ordered, while records in a table are always considered unordered.
Events, once occured, can never be modified. Instead an additional event is written to the stream, recording a cancellation of previous transaction.

Streams contain a stream of events and each event caused a change. A table contains a current state of the world, a state that is a result of many changes.

If we can capture all the changes that happen to the database table in a stream of events, we can have our stream processing job listen to this stream and update the cache based on database change events.

Stream-table join: one of the streams represents changes to a locally cached table. One stream is joining a stream with a table to enrich all events with info in the table. This is similar to joining a fact table with a dimension.
Stream-stream join: Streams have the same key and happened in the same time windows, also called a windowed-join. They are partitioned on the same keys, which are also the join keys.

Kafka Streams scales by allowing multiple threads of executions within one instance of the application., and supporting load balancing between distributed instances of the application. The number of tasks is determined by streams engine, and depends on the number of partitions in the topics. Each task is responsible for a subset of the partitions. The developer of the application can choose the number of threads each application instance will execute. Each task will independently process events from those partitions and maintain its own local state with relevant aggregates.

Kafka assigns all the partitions needed for one join to the same task, so this task can consume from all the relevant partitions and perform the join independently. Kafka streams requries that all topics that participate in a join operation will hae the same number of partitions and be partitioned based on the join key. Instead of shuffling, Kafka streams re-partitions by writing the events to a new topic with the new keys and partitions. It reduce dependencies between different parts of a pipeline.

Global KTable is more service concept, while KTable is more about streaming concept:

All partitions replicated to all nodes.
Supports N-way join
Blocks until initialized
Doesn't trigger processing(reference resources)

By default KStreams keeps state store backed up on a secondary node.

No comments:

Post a Comment