Apache Kafka
Variants:
What is Kafka?
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments. Your interviewer will expect you to understand the core concepts of Kafka and when to use it in a system design.
Core Concepts of Kafka
- Topics, Partitions, and Offsets: Kafka organizes data into topics. A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it. For each topic, the Kafka cluster maintains a partitioned log. Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential ID number called the offset that uniquely identifies each record within the partition.
- Producers and Consumers: Producers are those client applications that publish (write) events to Kafka, and consumers are those that subscribe to (read and process) these events.
- Brokers: A Kafka cluster is composed of one or more servers, each of which is called a broker.
- ZooKeeper: Kafka uses ZooKeeper to manage the cluster. ZooKeeper is used to coordinate the brokers/cluster topology.
Producers, Consumers, and Brokers
The core of Kafka's architecture is the interplay between producers, consumers, and brokers. Your interviewer will expect you to have a clear understanding of how these components work together.
-
Producers: Producers are responsible for writing data to Kafka topics. They can choose to write to a specific partition or let Kafka choose a partition for them. Producers can also be configured to receive acknowledgements from the broker when a message has been successfully written.
-
Consumers: Consumers read data from Kafka topics. They subscribe to one or more topics and read the messages in the order in which they were produced. Consumers keep track of which messages they have already read by using an offset.
-
Brokers: Brokers are the servers that make up a Kafka cluster. They are responsible for storing the data and serving it to consumers. Each broker is a leader for some of its partitions and a follower for others.
Consumer Groups
A consumer group is a group of consumers that work together to consume a topic. Each partition is consumed by exactly one consumer in the group. This allows you to scale out the consumption of a topic by adding more consumers to the group.
Durability and Reliability
Kafka is designed to be a highly durable and reliable system. Your interviewer will want to know how Kafka achieves this.
-
Replication: Kafka replicates the partitions of a topic across multiple brokers. This means that if a broker fails, the data is still available on the other brokers. The number of replicas is configurable per topic.
-
Leader Election: For each partition, one broker is elected as the leader. All writes and reads for that partition go through the leader. If the leader fails, one of the followers is automatically elected as the new leader.
-
Acknowledgements: Producers can be configured to wait for an acknowledgement from the broker before considering a write to be successful. There are three levels of acknowledgement:
acks=0
: The producer does not wait for an acknowledgement. This provides the lowest latency but the weakest durability guarantees.acks=1
: The producer waits for an acknowledgement from the leader. This is the default.acks=all
: The producer waits for an acknowledgement from all in-sync replicas. This provides the strongest durability guarantees but the highest latency.
How to Use Kafka in a System Design Interview
When you're in a system design interview, you should be able to articulate why you would choose Kafka over other message queues and how you would use it in your architecture.
Here are some key points to mention:
- High Throughput: Kafka is designed for high throughput. It can handle millions of messages per second, which makes it a great choice for applications that generate a large volume of data.
- Scalability: Kafka is a distributed system that can be scaled out by adding more brokers to the cluster.
- Durability and Reliability: Kafka is highly durable and reliable. Messages are persisted to disk and replicated within the cluster to prevent data loss.
- Real-time Processing: Kafka is a great choice for real-time data pipelines. You can use it to build applications that react to events as they happen.
- Decoupling: Kafka is a great way to decouple your services. Producers can publish messages to Kafka without knowing who the consumers are, and consumers can process messages without knowing who the producers are.
By discussing these points, you'll demonstrate to your interviewer that you have a solid understanding of Kafka and how to use it to build scalable, reliable, and real-time systems.
Example System Design Problems
Here are a few examples of system design problems where you might use Kafka:
- Design a Distributed Job Scheduler: Kafka can be used as the message queue to decouple the job creation service from the worker nodes. When a new job is created, it's published to a Kafka topic, and the worker nodes consume the jobs from the topic.
- Design a Chat App: Kafka can be used to deliver messages in a chat application. When a user sends a message, it's published to a Kafka topic, and the recipient's client consumes the message from the topic.
- Design a Real-time Analytics System: Kafka is a great choice for building real-time data pipelines. You can use it to ingest a high volume of data from multiple sources and then process it in real-time using a stream processing framework like Apache Flink or Spark Streaming.