Monday, 10 April 2017

Kafka Streams

Lambda vs Kappa Architecture

  • A scalable high-latency batch system that can process historical data and a low-latency stream processing system that can't reprocess results.
  • Use Kafka to retain the full log of the data you want to be able to reprocess Retaining large amounts of data in kafka is a perfectly natural and economical thing to do and won't hurt performance.

The KStreams DSL is composed of two main abstractions; the KStream and KTable interfaces.
  • A KStream is a record stream where each key-value pair is an independent record. Later records in the stream don’t replace earlier records with matching keys. 
  • A KTable on the other hand is a “changelog” stream, meaning later records are considered updates to earlier records with the same key.
Local State Store:

Kafka streams provide an efficent way to model the application state.
The state store is partitioned the same way as the application's key space. As a result, all the data required to serve the queries that arrive at a particular application instance are available locally in the state store shards. Fault tolerance for this local state store is provided by kafka streams by logging all updates made to the state store, transparently, to a highly-available and durable kafka topic.
Kafka streams uses kafka like a commit log for its local, embedded database.

The aggregation is done on each instance with a local state. Then, the results are written to a new topic with a single partition, which will be read by a single application instance. This type of multi-phase processing is very familiar to those  mapreduce.

Interactive Queries allow us to treat the stream processing layer as a lightweight, embedded database and directly query the state of your stream processing application, without needing to materialize that state to external databases or storage.

Kafka steams simplifies the stream processing architecture by eliminating the need for a separate cluster. These embedded databases act as materialized views of logs.
Materialized views provide better application isolation because they are part of an application's state, also provide better performance.

Interactive Queries enables faster and more efficient use of the application state. There is no duplication of data between the store doing aggregation for stream processing and the store answering queries.

Discovering any instance's stores

In kafka, each instance may expose its endpoint information metadata to other instances of the same application. The IQ APIs allow a developer to obtain the metadata for a given store name and key, and do that across the instances of an application. Then, we can discover where the store is that holds a particular key by examining the metadata.

Event streams are ordered, while records in a table are always considered unordered.
Events, once occured, can never be modified. Instead an additional event is written to the stream, recording a cancellation of previous transaction.

Streams contain a stream of events and each event caused a change. A table contains a current state of the world, a state that is a result of many changes.

If we can capture all the changes that happen to the database table in a stream of events, we can have our stream processing job listen to this stream and update the cache based on database change events.

Stream-table join: one of the streams represents changes to a locally cached table. One stream is joining a stream with a table to enrich all events with info in the table. This is similar to joining a fact table with a dimension.
Stream-stream join: Streams have the same key and happened in the same time windows, also called a windowed-join. They are partitioned on the same keys, which are also the join keys.

Kafka Streams scales by allowing multiple threads of executions within one instance of the application., and supporting load balancing between distributed instances of the application. The number of tasks is determined by streams engine, and depends on the number of partitions in the topics. Each task is responsible for a subset of the partitions. The developer of the application can choose the number of threads each application instance will execute. Each task will independently process events from those partitions and maintain its own local state with relevant aggregates.

Kafka assigns all the partitions needed for one join to the same task, so this task can consume from all the relevant partitions and perform the join independently. Kafka streams requries that all topics that participate in a join operation will hae the same number of partitions and be partitioned based on the join key. Instead of shuffling, Kafka streams re-partitions by writing the events to a new topic with the new keys and partitions. It reduce dependencies between different parts of a pipeline.

Global KTable is more service concept, while KTable is more about streaming concept:

All partitions replicated to all nodes.
Supports N-way join
Blocks until initialized
Doesn't trigger processing(reference resources)

By default KStreams keeps state store backed up on a secondary node.


  1. Existing without the answers to the difficulties you’ve sorted out through this guide is a critical case, as well as the kind which could have badly affected my entire career if I had not discovered your website.
    full stack developer training in bangalore

  2. I wish to show thanks to you just for bailing me out of this particular trouble.As a result of checking through the net and meeting techniques that were not productive, I thought my life was done
    digital marketing training in tambaram

  3. Resources like the one you mentioned here will be very useful to me ! I will post a link to this page on my blog. I am sure my visitors will find that very useful
    Click here:
    python training in tambaram
    Click here:
    python training in annanagar

  4. Thanks for the informative article. This is one of the best resources I have found in quite some time. Nicely written and great info. I really cannot thank you enough for sharing.
    Blueprism training in Chennai

    Blueprism training in Bangalore

    Blueprism training in Pune

  5. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Data Science Training in Chennai
    Data science training in bangalore
    Data science online training
    Data science training in pune

  6. That was a great message in my carrier, and It's wonderful commands like mind relaxes with understand words of knowledge by information's.

    java training in tambaram | java training in velachery

    java training in omr | oracle training in chennai

  7. The knowledge of technology you have been sharing thorough this post is very much helpful to develop new idea. here by i also want to share this.
    python training in pune | python training institute in chennai | python training in Bangalore