Alvin's Big Data Notebook : Introduction to Kafka

1. Apache Kafka is a distributed publish-subscribe messaging system maintaining a feed of messages categorized into topics. A topic is a queue that can be consumed by multiple consumers. As a distributed service, Kafka has significant scalability and reliability advantages over traditional queuing and messaging systems such as ActiveMQ.

2. Kafka stores published messages at brokers in logical log in an appending approach. A message is only exposed to the consumers after it is flushed. Messages are stored in topics, which are split into partitions, which are replicated.Replicas of a partition are never read from or written to, just for backup.

3. Each message in a partition has a unique sequence number associated with it called an offset. Each pull request contains the offset of the message from which the consumption begins and an acceptable number of bytes to fetch. After receives a message, it computes the offset of the next message to consume and uses it in the next pull request. The brokers don't know the offset. A consumer can deliberately rewind back to an old offset and re-consume data in retention time(7 days).

4. Kafka only guarantees at-least-once delivery. When a consumer process crashes without a clean shutdown, another consumer taking over those partitions may get duplicate messages. We must add de-duplication logic in the consumer.

5. Message is partitioned and pushed by the producer, and the broker stores the messages in the same order as they arrive. The consumer can pull the messages at the maximum rate it can sustain and avoid being flooded by messages pushed faster than it can handle. The pull model also makes it easy to rewind a consumer.

6. Messages in the partitions are each assigned a unique(per partition) and sequential id called offset.
Consumers track their pointers via (offset, partition, topic) tuples.

7. Zookeeper used by Kalfka v0.8
Broker: general state info, leader election; Consumers: track message offsets.

8. Integrate Kafka with Hadoop

(1) Deploying Kafka on a separate cluster than Hadoop, but in the same LAN.

Kafka is a heavy disk I/O and memory user, and it can lead to resource contention with Hadoop tasks. At the same time it doesn’t benefit from being co-located on DataNodes since MapReduce tasks are not aware of data locality in Kafka.

(2) Alternative Integrators

SparkStreaming - Kafka reciever supports Kafka 0.8 and above
Camus - LinkedIn's Kafka=>HDFS pipeline, no documentations on configuration.
Kafka Hadoop Loader A different take on Hadoop loading functionality from what is included in the main distribution.
Kafka's built-in Hadoop consumer
Flume - Contains Kafka Source (consumer) and Sink (producer)

9. Consumer Groups

Each consumer is represented as a process and these processes are organized within groups called consumer groups.

A group ID is associated with each consumer. All the consumers with the same group ID act as a single logical consumer. Each message of the topic is delivered to one consumer from a consumer group (with the same group ID). Different consumer groups for a particular topic can process messages at their own pace as messages are not removed from the topics as soon as they are consumed.

10. Data Rentention

Each topic in Kafka has an associated retention time that can be controlled with the log.retention.minutes property in the broker configuration. When this time expires, Kafka deletes the expired data files for that particular topic. This is a very efficient operation as it's a file delete operation.

Alvin's Big Data Notebook

Saturday, 4 October 2014

Introduction to Kafka

No comments:

Post a Comment