Design a Chat App (WhatsApp)

Medium

Design a chat application like WhatsApp. The service should allow users to send and receive messages in real-time, both in one-on-one and group conversations. The system must be highly reliable, ensuring message delivery, and scalable to support millions of concurrent users.

Variants:

Facebook MessengerTelegramSignalDiscord

Loading diagram...

WhatsApp System Design Requirements

Functional Requirements

One-on-One Chat

Users should be able to send and receive messages in one-on-one conversations.

Group Chat

Users should be able to create and participate in group conversations.

Delivery Receipts

Users should be able to see when a message has been delivered and when it has been read.

Offline Messaging

Users should be able to receive messages even when they are offline.

Non-Functional Requirements

High Availability

The service must be highly available, with minimal downtime.

Low Latency

Messages should be delivered in near real-time.

Scalability

The system must be able to handle millions of concurrent users and billions of messages per day.

Durability

Messages should never be lost.

CAP Theorem Trade-offs

Trade-off Explanation:

For a chat application, we prioritize Availability and Partition Tolerance (AP). It's more important for users to be able to send and receive messages than to have strict consistency. Eventual consistency is acceptable, as a slight delay in message delivery is a better user experience than the service being unavailable.

Scale Estimates

User Base2B DAU (2 * 10^9)

Base assumption for system sizing

Messages/Day100B messages (10^11)

The number of messages our system must handle daily.

Peak Messages/Second1.5M messages

Data per Message500 bytes

Daily Data Ingestion50TB

Yearly Data Ingestion18.25PB

API Design

The API for a chat application is different from a typical request-response model. While we'll have some standard HTTP endpoints for user management, the core of the messaging functionality will rely on a persistent connection between the client and the server. Your interviewer will expect you to discuss the trade-offs of different real-time communication protocols.

For the purpose of this section, we'll define the HTTP-based endpoints for user and chat management. The real-time messaging would be handled over a WebSocket or gRPC stream, which isn't easily represented in a standard API panel.

POST/api/v1/users

POST/api/v1/chats

Create a new one-on-one or group chat

GET/api/v1/chats/{chat_id}/messages

Get the message history for a chat

Database Schema

The database schema for a chat application needs to be optimized for fast writes (sending messages) and efficient reads (loading chat history). We'll need to store users, chats, messages, and the relationships between them.

We'll use a users table for user information, a chats table to store metadata about each conversation, a chat_members table to link users to chats, and a messages table to store the actual message content. The messages table will be partitioned by chat_id to ensure that all messages for a given chat are stored together for efficient retrieval.

users

user_id

stringPartition Key

phone_number

string

display_name

string

created_at

timestamp

chats

chat_id

stringPartition Key

is_group_chat

boolean

group_name

string

created_at

timestamp

chat_members

chat_id

stringPartition Key

user_id

stringSort Key

joined_at

timestamp

messages

chat_id

stringPartition Key

message_id

stringSort Key (e.g., a timestamp-based UUID)

sender_id

string

content

string

created_at

timestamp

High-Level Architecture

A chat application's architecture is fundamentally different from a standard web application. It's a distributed system that relies on persistent connections and real-time message passing. Your interviewer will expect you to discuss the trade-offs of different real-time communication protocols and how you would ensure reliable message delivery.

Core Services

Chat Service: This is the heart of our system. It's a stateful service that maintains a persistent connection with each online user. It's responsible for receiving messages from users and fanning them out to the other participants in a chat.
Presence Service: This service is responsible for tracking the online status of each user. It will expose an API that the Chat Service can use to determine if a user is online and which server they are connected to.
Push Notification Service: This service is responsible for sending push notifications to users who are offline. When the Chat Service receives a message for an offline user, it will forward it to this service, which will then send a push notification to the user's device via APNs or FCM.

Deep Dive: Delivering Messages

The core of a chat application is its ability to deliver messages in real-time. This requires a persistent connection between the client and the server. Your interviewer will expect you to discuss the different technologies that can be used to achieve this.

WebSockets

Why: WebSockets provide a full-duplex communication channel over a single TCP connection. This is the most common and efficient way to implement real-time messaging on the web.

How it works: The client establishes a WebSocket connection with the server. This connection remains open, allowing the server to push messages to the client as soon as they are received. This is much more efficient than long polling, as it avoids the overhead of constantly creating new HTTP requests.

Trade-offs:

Pros: Low latency, efficient use of resources, full-duplex communication.
Cons: Not supported by all older browsers and proxies.

Long Polling

Why: Long polling is a technique that simulates a server push by having the client make a request to the server that is held open until a message is available. This is a good fallback for clients that don't support WebSockets.

How it works: The client makes an HTTP request to the server. The server holds this request open until it has a message to send to the client. Once the message is sent, the client immediately makes another request.

Trade-offs:

Pros: Supported by all browsers, simpler to implement than WebSockets.
Cons: Higher latency than WebSockets, less efficient use of resources.

gRPC Streaming

Why: gRPC is a high-performance RPC framework that supports streaming. It's a great choice for mobile clients, as it's more efficient than WebSockets in terms of battery and network usage.

How it works: gRPC uses HTTP/2 for transport, which allows for bidirectional streaming over a single connection. The client and server can both send a stream of messages to each other. gRPC also uses Protocol Buffers for serialization, which is more efficient than JSON.

Trade-offs:

Pros: Highly efficient, great for mobile clients, supports bidirectional streaming.
Cons: Not natively supported by browsers (requires a proxy like Envoy), more complex to set up than WebSockets.

Deep Dive: Ensuring Users Connect to the Right Service

In a distributed chat system, a user can be connected to any one of many chat servers. When another user sends them a message, we need a way to find out which server they are connected to so we can deliver the message. Your interviewer will want to know how you would solve this service discovery problem.

Redis Pub/Sub

Why: Redis Pub/Sub is a simple and effective way to broadcast messages to all chat servers.

How it works: When a chat server receives a message, it publishes the message to a Redis channel that corresponds to the recipient's user_id. All chat servers are subscribed to all channels. When a server receives a message on a channel, it checks if the recipient is connected to it. If so, it delivers the message.

Trade-offs:

Pros: Simple to implement, very fast.
Cons: Not very scalable, as every server has to process every message.

Delivery Queue

Why: A more scalable approach is to use a message queue to deliver messages to the correct server.

How it works: We can use a service discovery mechanism (like Zookeeper or a simple Redis hash) to keep track of which server each user is connected to. When a chat server receives a message, it looks up the recipient's server in the service discovery system and then enqueues the message in a queue that is specific to that server. Each server has a consumer that polls its queue for new messages.

Trade-offs:

Pros: Highly scalable, as each server only has to process the messages that are intended for it.
Cons: More complex to implement, introduces a single point of failure if the service discovery system goes down.

Deep Dive: Message Storage and Offline Notifications

Store-and-Forward

A key concept in any reliable messaging system is "store-and-forward." This means that when a message is sent, it is first stored in a durable data store before being forwarded to the recipient. This ensures that if the recipient is offline, or if there is a network issue, the message will not be lost. Once the recipient comes back online, they can retrieve the message from the data store.

For our message store, we'll use a NoSQL database like DynamoDB, partitioned by chat_id for efficient retrieval of a chat's message history. All messages will be encrypted at rest to ensure user privacy.

When a message is sent to an offline user, we need to send them a push notification to let them know they have a new message. We can use a message queue like SQS to handle the delivery of these notifications. When the chat service receives a message for an offline user, it will enqueue a job in the push notification queue. A separate worker service will then dequeue the job and send the push notification via APNs (for iOS) or FCM (for Android).

Complete Design

Now that we've covered all the major components individually, let's look at how everything fits together in our complete WhatsApp system design. This diagram shows the end-to-end flow from a user sending a message to it being delivered to the recipient.

Loading diagram...

The complete architecture demonstrates how clients maintain a persistent connection to the Chat Service, which is responsible for fanning out messages to other users. The Presence Service tracks online status, and the Push Notification Service handles offline delivery. This design ensures a reliable, scalable, and real-time messaging experience.