Java Interview Preparation - Kafka
Published on

Java Interview Preparation - Kafka

Authors

Upstream system produces duplicate record. How consumer can eliminate duplicate message and read only 1 message

In Kafka, consumers can handle duplicate records by implementing idempotency. Idempotency ensures that multiple identical requests have the same effect as a single request. Here's how you can achieve this:

Message Key: Ensure that each message has a unique key. This key can be used to check if a message has already been processed.

Idempotent Consumer: Implement an idempotent consumer. This means that the consumer keeps track of the message keys it has already processed. If it receives a message with a key it has seen before, it knows that it is a duplicate and can ignore it.

Database Support: If you're storing the processed data in a database, you can use features of the database to help. For example, you could use a unique constraint on the message key in the database. If a duplicate message is processed, inserting it into the database will fail because of the unique constraint.

In Kafka does it have a concept of queue & topic?

In Apache Kafka, the primary abstraction for a stream of records is a topic. Kafka topics are similar to queues in other message passing systems, but with some key differences.

Topics: In Kafka, producers write data to topics. Topics are split into partitions, which is where Kafka gets its ability to scale horizontally. Each partition can be hosted on a different server, which means a single topic can be scaled across many servers to increase capacity. Consumers read from Kafka topics at their own pace and can also read from multiple partitions in parallel.

Queues: Kafka doesn't have queues in the traditional sense like other messaging systems such as RabbitMQ or ActiveMQ. Instead, Kafka provides consumer groups to enable many consumers to work together, consuming data in parallel. Kafka delivers each message in the topic to one consumer in every consumer group. This is similar to the concept of a queue in other messaging systems, but it's more flexible because each consumer group gets its own copy of the data.

So, in Kafka, you create topics. The concept of queues exists, but it's handled differently through the use of consumer groups.

Does Kafka support JMS?

No, you cannot use a JMS (Java Message Service) Listener with Apache Kafka directly because Kafka does not support the JMS API. Kafka has its own API for producing and consuming messages.

JMS is a Java API that allows applications to create, send, receive, and read messages in a loosely coupled, reliable, and asynchronous way. It's typically used with traditional messaging systems like ActiveMQ or RabbitMQ.

Kafka, on the other hand, is a distributed streaming platform designed to handle real-time data feeds with high throughput and fault tolerance. It has its own client APIs for different languages including Java.

Can you acknowledge messages in Kafka?

In Kafka, consumers can acknowledge messages by committing their offsets. Offsets are the position of a consumer in a partition. When a consumer reads a message from a partition, it can commit the offset to indicate that it has processed the message successfully.

There are two ways to commit offsets in Kafka:

Automatic Commit: In this mode, the consumer commits offsets automatically at regular intervals. This is the default behavior in Kafka. However, automatic commits can lead to message loss if the consumer crashes before committing the offset.

Manual Commit: In this mode, the consumer commits offsets explicitly after processing each message. This gives you more control over when offsets are committed and can help prevent message loss. However, you need to handle retries and error scenarios yourself.

How long does Kafka retain messages?

In Kafka, the retention period for messages is controlled by the log.retention.hours and log.retention.bytes configuration settings.

log.retention.hours: This setting specifies the maximum time that a message will be retained in a topic. After this time period, messages will be deleted from the topic.

log.retention.bytes: This setting specifies the maximum size of a topic's log segment files. Once a log segment reaches this size, it will be deleted, along with all the messages it contains.

By default, Kafka retains messages for 7 days (log.retention.hours=168). You can change this setting to retain messages for a longer or shorter period of time, depending on your requirements.

Like if we have multiple consumers in a consumer group, how does Kafka ensure that each message is processed by only one consumer?

When multiple consumers belong to the same consumer group, each message in a topic is delivered to one consumer in the group. Kafka ensures that each partition is only consumed by one consumer in the group at a time. This way, every message is consumed once and only once by the consumer group.

If you have more consumers than partitions in a topic, some consumers will be idle because each partition is only consumed by one consumer in the group. If you have more partitions than consumers, consumers will receive messages from multiple partitions.

Kafka maintains a mapping of which consumer is currently consuming from which partition. When a consumer has finished processing a message, it sends an acknowledgment to Kafka, which moves the offset forward. The offset is a pointer to the last record that Kafka has already sent to a consumer. By maintaining the offset, Kafka knows exactly where to start fetching the next batch of messages for each consumer.

How do we create partitions in Kafka?

In Kafka, partitions are created when you create a topic. You can specify the number of partitions in the --partitions option when creating a topic using the kafka-topics command.

Here's an example of how to create a topic with 3 partitions:

kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic my-topic

In this example, the kafka-topics command is used to create a new topic named my-topic with 3 partitions.

If you want to change the number of partitions for an existing topic, you can use the --alter option with the kafka-topics command:

kafka-topics --zookeeper localhost:2181 --alter --topic my-topic --partitions 6

This command increases the number of partitions of my-topic to 6. Note that you can only increase the number of partitions; you cannot decrease it. Also, increasing the number of partitions doesn't rebalance existing data, it only affects new data.

What is the concept of partitioning in Kafka?

Yes, the number of partitions in a Kafka topic is related to the number of Kafka servers (brokers) in a cluster, but they are not directly dependent on each other.

A Kafka partition is a unit of parallelism in Kafka, and each partition can be hosted on a different server. This means that the more partitions you have, the more you can take advantage of the parallel processing power of a Kafka cluster.

However, there are a few things to consider:

  1. More Partitions, More Throughput: More partitions can lead to higher throughput because you can have more consumers working in parallel (assuming you have enough consumer instances).

  2. Broker and Partition Balance: Kafka tries to spread each topic's partitions evenly across all brokers in the cluster. If you have more partitions than brokers, some brokers will have to host multiple partitions from the same topic.

  3. Overhead: Each partition has some overhead on the broker (in terms of open file handles, memory usage, etc.), so having a large number of partitions on a single broker can cause performance issues.

  4. Consumer Group Limit: The maximum number of consumers in a consumer group is bounded by the number of partitions. If you have more consumers in a consumer group than partitions, some consumers will be idle.

So, while the number of partitions doesn't directly depend on the number of Kafka servers, they should be planned together to ensure good performance and fault tolerance.

What is the purpose of message keys in Kafka?

  1. Partitioning: When producing messages to a topic, if a key is provided, Kafka will use a hash of the key to determine the partition within the topic where the message will be written. All messages with the same key will be written to the same partition. This ensures that all messages pertaining to the same key (for example, the same customer ID or the same country code) are stored in the same partition and hence are kept in order.

  2. Message Ordering: Within a single partition, messages are guaranteed to be stored in the order they were produced. This means that if you use keys and produce messages with the same key, you can rely on those messages being in order when consumed.

How can you achieve message ordering in Kafka?

Message ordering in Kafka is achieved through the concept of partitions. Within each partition, messages are ordered and appended to the partition in the order they are produced.

When a producer sends a message to a Kafka topic, the message is appended to the end of one of the topic's partitions. Each message within a partition is assigned a unique, sequential ID number called the offset. The offset for the first message in the partition is 0, and it increases by one for each message that is subsequently written to the partition.

When a consumer reads from a partition, it does so in order from the lowest offset to the highest, which means it reads the messages in the order they were added to the partition.

It's important to note that this ordering guarantee only applies within a single partition, not across multiple partitions in a topic. If strict ordering of all messages is required, you would need to use only one partition, which could limit throughput.