Design Instagram

Medium

Design a simplified version of Instagram. The service should allow users to post photos and videos, follow other users, and view a timeline of posts from the users they follow. The system must be highly available and scalable to handle millions of users and a high volume of media uploads and views.

Variants:

Facebook News FeedPinterestTikTok

Loading diagram...

Instagram System Design Requirements

Functional Requirements

Post Photos and Videos

Users should be able to upload and share photos and videos with their followers.

Follow Users

Users should be able to follow other users to see their posts in their timeline.

View Timeline

Users should be able to view a chronological feed of posts from the users they follow.

Like and Comment on Posts

Users should be able to like and comment on posts.

Non-Functional Requirements

High Availability

The service must be highly available, with minimal downtime for both posting and viewing content.

Low Latency

The timeline should load in under 200ms, and media should start streaming quickly.

Scalability

The system must scale to handle millions of concurrent users and a massive volume of media uploads.

Durability

Uploaded photos and videos must never be lost.

CAP Theorem Trade-offs

Trade-off Explanation:

For a media-sharing platform like Instagram, we prioritize Availability and Partition Tolerance (AP). It's more important for users to be able to upload and view content than to have strict, real-time consistency. A slight delay in a new post appearing on a follower's timeline is an acceptable trade-off for a highly available system.

Scale Estimates

User Base1B DAU (10^9)

Base assumption for system sizing

Media Uploads/Day100M uploads (10^8)

The number of new photos and videos our system must handle daily.

Timeline Reads/Second1M reads

Average Media Size1MB

Daily Data Ingestion100TB

Yearly Data Growth36.5PB

API Design

The API for our Instagram-like service will need to handle media uploads, user interactions (following), and timeline generation. Your interviewer will expect you to define clear endpoints for these core features.

The POST /api/v1/posts endpoint will handle the creation of new posts, including the upload of media. The POST /api/v1/follow endpoint will manage user relationships. The GET /api/v1/timeline endpoint will be a critical, read-heavy endpoint that constructs and returns a user's personalized feed.

POST/api/v1/posts

Create a new post with an image or video

POST/api/v1/follow

Follow another user

GET/api/v1/timeline

Get the user's timeline feed

Database Schema

The database schema for Instagram needs to efficiently manage users, posts (including media), and the social graph. Your interviewer will be interested in how you model these relationships for both fast writes (posting) and fast reads (timeline generation).

We'll use tables for users, posts, and follows. The posts table will store metadata about the media, while the actual media files will be stored in a blob store like S3. The follows table is crucial for constructing the timeline and represents the social graph.

users

user_id

stringPartition Key

username

string

profile_picture_url

string

bio

string

created_at

timestamp

posts

post_id

stringPartition Key

user_id

string

media_type

stringIMAGE or VIDEO

media_url

string

caption

string

created_at

timestamp

follows

follower_id

stringPartition Key

followee_id

stringSort Key

created_at

timestamp

High-Level Architecture

At a high level, our Instagram-like service will be composed of several microservices. This architecture allows for independent scaling of different parts of the system, which is crucial for handling the different traffic patterns of media uploads, timeline generation, and user interactions.

Core Services

Media Service: This service is responsible for handling all media uploads. When a user uploads an image or video, this service will process it, generate different versions and formats, and store them in a blob store like S3. It will then notify other services that the media is ready to be used in a post.
Timeline Service: This is a read-heavy service responsible for generating the user's timeline. It will use a combination of push and pull models to construct the feed, ensuring that it is both up-to-date and performant.
User Service: This service manages user accounts, profiles, and the social graph. It handles follow and unfollow operations and provides user information to other services.

Deep Dive: Timeline Generation (The Celebrity Problem)

Generating a user's timeline is a core challenge in any social media system design. Your interviewer will expect you to discuss the trade-offs between different approaches, especially how to handle users with millions of followers (the "celebrity problem").

Fan-out on Write (Push Model)

Why: This approach pre-computes the timeline for each user, making reads very fast. When a user posts, the post is pushed to the timelines of all their followers.

How it works: When a user creates a post, a background job is triggered. This job retrieves the list of the user's followers and inserts the new post's ID into a timeline data structure (e.g., a Redis list) for each follower. When a user requests their timeline, we simply read this pre-computed list.

Trade-offs:

Pros: Extremely fast timeline reads, which is the most frequent operation for most users.
Cons: If a celebrity with 50 million followers posts, we have to perform 50 million writes, which can be slow and resource-intensive. This is the "celebrity problem." It can also be inefficient if many followers are inactive.

Fan-out on Read (Pull Model)

Why: This approach generates the timeline on-demand, which is simpler and avoids the celebrity problem.

How it works: When a user requests their timeline, we first get the list of people they follow. Then, we query for the recent posts from all of those people, merge them, and sort by time.

Trade-offs:

Pros: No "celebrity problem" on the write path. Simpler to implement.
Cons: Timeline generation can be slow for users who follow many people, as it requires a complex, multi-get query and merge-sort operation on every request.

Hybrid Approach: The Best of Both Worlds

In a real-world system like Instagram, a hybrid approach is the most effective solution. Your interviewer will be impressed if you suggest this, as it shows a deeper understanding of the trade-offs.

We can use the push model for the vast majority of users who have a reasonable number of followers. For celebrities, we can switch to a pull model. When a regular user requests their timeline, we merge their pre-computed feed with the latest posts from any celebrities they follow. This provides fast timelines for most users while avoiding the write-time bottleneck of the celebrity problem.

Deep Dive: Media Transcoding

After a user uploads a photo or video, it's crucial to process it into various formats and sizes. This ensures that the media can be delivered efficiently to a wide range of devices and network conditions. Your interviewer will expect you to discuss how you would build a reliable and scalable transcoding pipeline.

The Role of DAGs in Media Processing

Media transcoding is not a single step, but a complex workflow of dependent tasks. For a video, this might include extracting audio, generating multiple video resolutions (e.g., 480p, 720p, 1080p), creating thumbnails, and then packaging everything for adaptive bitrate streaming. These tasks form a Directed Acyclic Graph (DAG), where some tasks must complete before others can begin.

Using a workflow orchestration engine that understands DAGs is essential for managing this complexity. It allows us to define the dependencies between tasks, handle retries for individual steps, and monitor the progress of the entire workflow. This is a key concept to bring up in your interview to show you've thought about the reliability of the system.

Apache Airflow with Kubernetes Executor

Why: Airflow is a mature, battle-tested workflow orchestration platform that is excellent for managing complex DAGs like our transcoding pipeline.

How it works: We can define our transcoding workflow as an Airflow DAG. When a new media file is uploaded, it triggers a new DAG run. The Kubernetes executor allows Airflow to dynamically spin up a new pod for each task with the specific resources it needs (e.g., a GPU-enabled pod for video transcoding).

Trade-offs:

Pros: Powerful and flexible, with a large community and many pre-built integrations. Excellent for managing complex dependencies and scheduling.
Cons: Can have significant operational overhead. The scheduling-based model may not be the best fit for event-driven workflows.

Temporal Workflow Engine

Why: Temporal is a more modern, code-first workflow engine that is particularly well-suited for long-running, stateful workflows like media transcoding.

How it works: With Temporal, the entire workflow is defined as a single function. Temporal's durable execution model ensures that the workflow's state is preserved across failures, so if a worker crashes during a long transcoding job, it can resume from where it left off.

Trade-offs:

Pros: More developer-friendly programming model. Excellent for long-running, stateful workflows. Built-in support for retries and error handling.
Cons: Newer technology with a smaller ecosystem than Airflow. Requires learning a new programming model.

Complete Design

Now that we've covered all the major components individually, let's look at how everything fits together in our complete Instagram system design. This diagram shows the end-to-end flow from a user posting a photo to it appearing in their followers' timelines.

Loading diagram...

The complete architecture demonstrates how the Media Service, Timeline Service, and User Service work together to provide a scalable and reliable Instagram-like experience. The use of a hybrid timeline generation model and a dedicated media processing pipeline ensures that the system can handle millions of users and a high volume of media uploads and views.