Design Instagram
Variants:
Instagram System Design Requirements
Functional Requirements
Post Photos and Videos
Users should be able to upload and share photos and videos with their followers.
Follow Users
Users should be able to follow other users to see their posts in their timeline.
View Timeline
Users should be able to view a chronological feed of posts from the users they follow.
Like and Comment on Posts
Users should be able to like and comment on posts.
Non-Functional Requirements
High Availability
The service must be highly available, with minimal downtime for both posting and viewing content.
Low Latency
The timeline should load in under 200ms, and media should start streaming quickly.
Scalability
The system must scale to handle millions of concurrent users and a massive volume of media uploads.
Durability
Uploaded photos and videos must never be lost.
CAP Theorem Trade-offs
Trade-off Explanation:
For a media-sharing platform like Instagram, we prioritize Availability and Partition Tolerance (AP). It's more important for users to be able to upload and view content than to have strict, real-time consistency. A slight delay in a new post appearing on a follower's timeline is an acceptable trade-off for a highly available system.
Scale Estimates
API Design
The API for our Instagram-like service will need to handle media uploads, user interactions (following), and timeline generation. Your interviewer will expect you to define clear endpoints for these core features.
The POST /api/v1/posts
endpoint will handle the creation of new posts, including the upload of media. The POST /api/v1/follow
endpoint will manage user relationships. The GET /api/v1/timeline
endpoint will be a critical, read-heavy endpoint that constructs and returns a user's personalized feed.
Create a new post with an image or video
Follow another user
Get the user's timeline feed
Database Schema
The database schema for Instagram needs to efficiently manage users, posts (including media), and the social graph. Your interviewer will be interested in how you model these relationships for both fast writes (posting) and fast reads (timeline generation).
We'll use tables for users
, posts
, and follows
. The posts
table will store metadata about the media, while the actual media files will be stored in a blob store like S3. The follows
table is crucial for constructing the timeline and represents the social graph.
users
posts
follows
High-Level Architecture
At a high level, our Instagram-like service will be composed of several microservices. This architecture allows for independent scaling of different parts of the system, which is crucial for handling the different traffic patterns of media uploads, timeline generation, and user interactions.
Core Services
-
Media Service: This service is responsible for handling all media uploads. When a user uploads an image or video, this service will process it, generate different versions and formats, and store them in a blob store like S3. It will then notify other services that the media is ready to be used in a post.
-
Timeline Service: This is a read-heavy service responsible for generating the user's timeline. It will use a combination of push and pull models to construct the feed, ensuring that it is both up-to-date and performant.
-
User Service: This service manages user accounts, profiles, and the social graph. It handles follow and unfollow operations and provides user information to other services.
Deep Dive: Timeline Generation (The Celebrity Problem)
Generating a user's timeline is a core challenge in any social media system design. Your interviewer will expect you to discuss the trade-offs between different approaches, especially how to handle users with millions of followers (the "celebrity problem").
Fan-out on Write (Push Model)
Why: This approach pre-computes the timeline for each user, making reads very fast. When a user posts, the post is pushed to the timelines of all their followers.
How it works: When a user creates a post, a background job is triggered. This job retrieves the list of the user's followers and inserts the new post's ID into a timeline data structure (e.g., a Redis list) for each follower. When a user requests their timeline, we simply read this pre-computed list.
Trade-offs:
- Pros: Extremely fast timeline reads, which is the most frequent operation for most users.
- Cons: If a celebrity with 50 million followers posts, we have to perform 50 million writes, which can be slow and resource-intensive. This is the "celebrity problem." It can also be inefficient if many followers are inactive.
Fan-out on Read (Pull Model)
Why: This approach generates the timeline on-demand, which is simpler and avoids the celebrity problem.
How it works: When a user requests their timeline, we first get the list of people they follow. Then, we query for the recent posts from all of those people, merge them, and sort by time.
Trade-offs:
- Pros: No "celebrity problem" on the write path. Simpler to implement.
- Cons: Timeline generation can be slow for users who follow many people, as it requires a complex, multi-get query and merge-sort operation on every request.
Hybrid Approach: The Best of Both Worlds
In a real-world system like Instagram, a hybrid approach is the most effective solution. Your interviewer will be impressed if you suggest this, as it shows a deeper understanding of the trade-offs.
We can use the push model for the vast majority of users who have a reasonable number of followers. For celebrities, we can switch to a pull model. When a regular user requests their timeline, we merge their pre-computed feed with the latest posts from any celebrities they follow. This provides fast timelines for most users while avoiding the write-time bottleneck of the celebrity problem.
Deep Dive: Media Transcoding
After a user uploads a photo or video, it's crucial to process it into various formats and sizes. This ensures that the media can be delivered efficiently to a wide range of devices and network conditions. Your interviewer will expect you to discuss how you would build a reliable and scalable transcoding pipeline.
The Role of DAGs in Media Processing
Media transcoding is not a single step, but a complex workflow of dependent tasks. For a video, this might include extracting audio, generating multiple video resolutions (e.g., 480p, 720p, 1080p), creating thumbnails, and then packaging everything for adaptive bitrate streaming. These tasks form a Directed Acyclic Graph (DAG), where some tasks must complete before others can begin.
Using a workflow orchestration engine that understands DAGs is essential for managing this complexity. It allows us to define the dependencies between tasks, handle retries for individual steps, and monitor the progress of the entire workflow. This is a key concept to bring up in your interview to show you've thought about the reliability of the system.
Apache Airflow with Kubernetes Executor
Why: Airflow is a mature, battle-tested workflow orchestration platform that is excellent for managing complex DAGs like our transcoding pipeline.
How it works: We can define our transcoding workflow as an Airflow DAG. When a new media file is uploaded, it triggers a new DAG run. The Kubernetes executor allows Airflow to dynamically spin up a new pod for each task with the specific resources it needs (e.g., a GPU-enabled pod for video transcoding).
Trade-offs:
- Pros: Powerful and flexible, with a large community and many pre-built integrations. Excellent for managing complex dependencies and scheduling.
- Cons: Can have significant operational overhead. The scheduling-based model may not be the best fit for event-driven workflows.
Temporal Workflow Engine
Why: Temporal is a more modern, code-first workflow engine that is particularly well-suited for long-running, stateful workflows like media transcoding.
How it works: With Temporal, the entire workflow is defined as a single function. Temporal's durable execution model ensures that the workflow's state is preserved across failures, so if a worker crashes during a long transcoding job, it can resume from where it left off.
Trade-offs:
- Pros: More developer-friendly programming model. Excellent for long-running, stateful workflows. Built-in support for retries and error handling.
- Cons: Newer technology with a smaller ecosystem than Airflow. Requires learning a new programming model.
Complete Design
Now that we've covered all the major components individually, let's look at how everything fits together in our complete Instagram system design. This diagram shows the end-to-end flow from a user posting a photo to it appearing in their followers' timelines.
The complete architecture demonstrates how the Media Service, Timeline Service, and User Service work together to provide a scalable and reliable Instagram-like experience. The use of a hybrid timeline generation model and a dedicated media processing pipeline ensures that the system can handle millions of users and a high volume of media uploads and views.