Alvin's Big Data Notebook : Kafka 0.10 Developer Training Notes

Kafka decouples data source and destination systems via a pub/sub architecture.

Data retention time in Kafka can be configured on a per-topic basis.

If no key is specified for messages and using default partitioner, the message may be sent to any partition on a round-robin basis in the topic. All messages with the same key will go to the same partition.

Typically, different systems will write to different topics.

Partitions:

Each partition is stored on the broker's disk as one or more commit log files.
Each message in the log is identified by its offset.

Data within a partition will be stored in the order in which it is written.
Therefore, data read from a partition will be read in order for that partition.
But no total ordering when reading from multiple partitions.
Data from a partition will go to a single consumer in the group.
All messages with the same key will go to the same consumer.

The number of useful consumers in a consumer group is constrained by the number of partitions on the topic.

One broker is the leader for a partition. Other brokers are followers.
Leaders for different partitions can be on different brokers.
All writes and reads ONLY go to and from the leader, rather than followers.
Replicas only exist to provide reliability in case of broker failure.
If a leader fails, the Kafka cluster will elect a new leader from among the followers by using ZK.

In-Sync Replicas are replicas which are up-to-date with the leader. If the leader fails, it is the list of ISRs which is used to elect a new leader.

To seek to the beginning of all partitions that are being read by a consumer for a particular topic:

consumer.subscribe("mytopic");
consumer.poll(0);
consumer.seekToBeginning(consumer.assignment());

assignment() returns a list of all partitions

Downside of too many Partitions:

Partition unavailability in the event of broker failure.
End-to-end latency: A broker uses a single thread to replicate data from another broker.
Memory requirements(buffers on clients are per Partition)

Producers:

KafkaProducer is thread safe; sharing a single producer instance across threads will typically be faster than having multiple instances.

The send() call is asynchronous. It returns immediately after it has added the message to a buffer of pending record sends. This allows producer to send in batches for better performance.
It returns a Future which contains a RecordMetadata object.
A Callback will be invoked when the send() has been acknowledged.

Consumers:

The consumer offset is the value of the next message the consumer will read, not the last message that has been read.

The advantages of pulling, rather than pushing data:

The ability to add more consumers to the system without reconfiguring the cluster.
The ability for a consumer to go offline and return later, resuming from where it left off.
The consumer can pull and process data at whatever speed it needs to.

If the number of consumers changes, a partition rebalance occurs:

Partition ownership is moved around between the consumers to spread the load evenly.
Consumers cannot consume messages during the rebalance, so this results in a short pause in message consumption.

The consumer offset is stored in a special Kafka topic(__Consumer_offsets). Previous to 0.9, offsets were stored in Zookeeper. In some cases, you may want to store offsets in a database table. Then read the value and use seek() to move to the correct position.

consumer.poll(Long.MAX_VALUE)

Each call to poll returns a list of messages. The parameter controls the maximum amount of time in ms that the consumer will block if no new records are available. If records are available, it will return immediately.

Automatic Commit:

Consumer offsets are periodically(5 seconds by default) committed during the poll() call.

Note that KafkaConsumer is not thread-safe.

Avro Schema Management:

Serialization is a way of representing data in memory as a series of bytes. Needed to transfer data across the network, or store it on disk. Deserialization is the process of converting the stream of bytes back into the data object.

Avro has three ways of creating records:
- Generic: Write code to map each schema field to a field in your object.
- Reflection: Generate a schema from an existing Java class.
- Specific: Generate a Java class from your schema.

The namespace is the Java package name, which you will import into your code.

Backward Compatibility: Code with a new version of schema can read the data written in old schema by filling default values if fields are not provided.

Forward Compatibility: Code with previous version of schema can read data written in new schema by ignoring new fields.

Schema Registry allows you to submit a schema, in the background, returns a schema Id which is used in subsequent Produce and Consume requests. It stores schema info in a special Kafka topic.

Alvin's Big Data Notebook

Friday, 27 January 2017

Kafka 0.10 Developer Training Notes

No comments:

Post a Comment