Design YouTube

Medium

YouTube is one of the world's largest video-sharing platforms, serving billions of users globally. This system design explores the architecture needed to handle massive scale of video upload/transcoding and efficient streaming. For this question, we'll ignore features like ML Ranking, Likes, and View counts and focus on the core functionality.

Variants:

NetflixVimeoTwitchHuluDailymotionTikTok

Loading diagram...

YouTube System Design Requirements

Functional Requirements

Video Upload & Storage

Users should be able to upload videos of various formats and resolutions. The system should process and store videos efficiently with multiple quality options.

Video Streaming & Playback

Users should be able to stream videos smoothly with adaptive bitrate streaming based on network conditions and device capabilities.

Non-Functional Requirements

Scalability

The system should handle billions of videos and support millions of concurrent users with horizontal scaling capabilities.

Performance & Latency

Video streaming should have minimal buffering with <2 second start time. Search results should return within 100ms for optimal user experience.

Availability & Reliability

The system should maintain 99.9% uptime with robust failover mechanisms and data replication across multiple regions.

CAP Theorem Trade-offs

Trade-off Explanation:

YouTube prioritizes availability and partition tolerance over strict consistency. Users can tolerate slight delays in seeing new uploads or updated view counts, but the service must remain available globally and handle network partitions between data centers gracefully.

Scale Estimates

User Base100M users (10^8)

Base assumption for system sizing

Total Storage Required100 PB (10^17 bytes)

Massive storage needed for video files and backups

Videos Uploaded/Day10M videos (10^7)

Storage per Video1 GB (10^9 bytes)

Daily Storage Growth10 TB (10^13 bytes)

Yearly Storage Growth3.6 PB (3.6 × 10^15 bytes)

High-Level Architecture

Clients

While the client is sometimes overlooked, it's important to correctly identify the platforms we are building for. For this system design, we will assume we need to support iOS, Android, and web clients.

Loading diagram...

Video Metadata Server + Metadata Database Design

Loading diagram...

For our database, we're choosing DynamoDB because we need a highly available and scalable solution. While traditional SQL databases have also become incredibly scalable, DynamoDB's architecture is well-suited for the massive throughput required by a platform like YouTube. Ultimately, both SQL and NoSQL can work, and the choice often comes down to team familiarity and specific access patterns. There's no right answer, but we're going to use DynamoDB.

Database Schema

users

user_id

stringPartition Key

username

string

created_at

timestamp

videos

video_id

stringPartition Key

upload_date

timestampSort Key

uploader_id

string

title

string

description

string

duration_seconds

int

category_id

string

Video Upload

When handling large file uploads, such as videos, it's crucial to avoid uploading directly to the application server. This approach can quickly overwhelm the server's resources, leading to slow performance and potential crashes. Instead, the best practice is to upload directly to a cloud-based blob storage provider. This offloads the heavy lifting of file storage to a dedicated, highly scalable service, ensuring your application remains responsive.

Blob Storage

Blob (Binary Large Object) storage is a type of cloud storage designed for unstructured data like videos, images, and backups. The most popular provider is Amazon S3, but other major players include Google Cloud Storage and Azure Blob Storage. These services provide high durability, availability, and scalability, making them ideal for storing large amounts of data. In any system design interview that involves uploading binary files, your first thought should always be to use a blob storage provider.

Video Processing and Transcoding

Loading diagram...

Once a video is successfully uploaded, we need to process and transcode it into multiple formats and resolutions to ensure optimal streaming across different devices and network conditions. This involves creating adaptive bitrate versions (360p, 720p, 1080p, 4K), generating thumbnails, extracting metadata, and potentially applying content moderation filters.

The transcoding pipeline is one of the most resource-intensive parts of the YouTube system, often requiring hours of compute time for longer videos. We need a solution that can handle massive scale, provide reliability through retries, and efficiently manage compute resources.

Adaptive Bitrate Streaming

The core of modern video streaming relies on Adaptive Bitrate (ABR) streaming, which dynamically adjusts video quality based on the viewer's network conditions and device capabilities. Instead of serving a single video file, we generate multiple versions of each video at different resolutions and bitrates, then package them into streaming formats that clients can intelligently switch between.

The two dominant streaming protocols are HLS (HTTP Live Streaming) and MPEG-DASH (Dynamic Adaptive Streaming over HTTP). HLS was developed by Apple and is the primary format for iOS devices, Safari, and Apple TV, using .m3u8 playlist files to describe video segments. MPEG-DASH is an open standard supported by most other platforms including Android, Chrome, Firefox, and smart TVs, using XML-based MPD (Media Presentation Description) files.

During transcoding, we generate multiple renditions of each video - typically 360p, 480p, 720p, 1080p, and sometimes 4K - each with optimized bitrates for different network conditions. We also create audio-only tracks for bandwidth-constrained scenarios. All of these segments get uploaded to S3 with a structured naming convention like videos/{video_id}/{resolution}/{segment_number}.ts for HLS or .m4s for DASH.

The manifest file is the key piece that ties everything together. For HLS, the master .m3u8 playlist contains references to individual quality playlists, while MPEG-DASH uses a single .mpd file that describes all available representations. These manifest files are generated after transcoding completes and uploaded to S3, allowing clients to discover available quality options and seamlessly switch between them during playback based on network performance.

Now that we understand the complexity of adaptive bitrate streaming, it becomes clear that orchestrating this entire process requires sophisticated workflow management. We're not just transcoding a single video file - we need to coordinate dozens of parallel tasks including audio extraction, multiple video rendition generation, thumbnail creation, manifest file generation, and metadata updates. Each of these tasks has dependencies, retry requirements, and resource constraints that must be carefully managed. This is where DAG (Directed Acyclic Graph) orchestrators become essential for managing the intricate relationships between all these transcoding tasks.

Apache Airflow with Kubernetes Executor

Apache Airflow provides a robust workflow orchestration platform that's perfect for managing complex transcoding pipelines. With Airflow, you define your transcoding workflow as a Directed Acyclic Graph (DAG) where each task represents a step in the process - extracting audio, transcoding to different resolutions, generating thumbnails, and updating metadata.

The Kubernetes executor is particularly powerful for transcoding workloads because it can dynamically spin up pods with the exact CPU and memory requirements for each transcoding task. When a 4K video needs processing, Airflow can launch a high-memory pod with GPU acceleration, while smaller videos get processed on standard compute nodes. This provides excellent resource efficiency since you're not keeping expensive transcoding workers idle.

Airflow's built-in retry mechanisms and failure handling are crucial for video processing, where individual tasks might fail due to corrupted video segments or resource constraints. The web UI provides excellent visibility into pipeline progress, making it easy to debug failed transcoding jobs and monitor overall system health. You can also set up alerting for when transcoding queues get backed up or when certain video formats consistently fail.

The main trade-off is complexity - Airflow requires significant operational overhead to run properly, including setting up the scheduler, web server, and managing the underlying Kubernetes cluster. However, if you're already using Kubernetes for other workloads, this integration feels natural and provides powerful scaling capabilities.

Temporal Workflow Engine

Temporal offers a more modern approach to workflow orchestration that's particularly well-suited for long-running, stateful processes like video transcoding. Unlike traditional queue-based systems, Temporal workflows are written as regular code that can maintain state across failures, making complex transcoding logic much easier to reason about and debug.

The key advantage of Temporal is its handling of long-running workflows. Video transcoding can take hours for large files, and Temporal's durable execution ensures that if a worker crashes or gets restarted, the workflow picks up exactly where it left off without losing progress. This is incredibly valuable for expensive transcoding operations where you don't want to restart from scratch due to infrastructure hiccups.

Temporal's activity retry policies are more sophisticated than traditional queue systems, allowing you to define exponential backoff strategies, maximum retry limits, and even custom retry logic based on the type of failure. For transcoding, this means you can aggressively retry temporary network issues while immediately failing on corrupted video files that will never succeed.

The workflow visibility and debugging experience is excellent - you can see exactly which step of transcoding failed, inspect the input parameters, and even replay workflows with different configurations. This makes troubleshooting much faster than digging through logs from multiple different services.

The main consideration with Temporal is that it's a newer technology with a smaller ecosystem compared to Airflow. You'll need to invest in learning the Temporal programming model, and there are fewer pre-built integrations with existing tools. However, the programming model is much more intuitive for developers, especially when dealing with complex conditional logic in transcoding workflows.

Which Solution Should You Choose?

For most established organizations already running Kubernetes, Airflow provides a battle-tested solution with extensive community support and integrations. It's particularly strong if you need complex scheduling requirements or integration with existing data pipeline tools.

Temporal shines when you have complex transcoding logic with lots of conditional branching, need strong durability guarantees, or want a more developer-friendly approach to workflow management. It's especially compelling for teams that value code-first configuration over UI-based workflow definition.

Both solutions can handle YouTube-scale transcoding workloads effectively - the choice often comes down to your team's existing expertise and operational preferences.

Resumable Uploads

When users are uploading large video files (which can be several gigabytes), we need a robust system that can handle network interruptions, connection drops, and other failures. Nobody wants to restart a 2-hour upload because their WiFi hiccupped at 99% completion!

There are two main approaches we can take here, and both have their merits depending on your infrastructure setup.

Using Amazon S3 Multipart Upload

If you've chosen S3 as your blob storage solution (which is pretty common), then you're in luck - AWS has already solved this problem for you with multipart uploads. This is honestly the route I'd recommend for most real-world scenarios.

Here's how S3 multipart upload works: The client first calls S3's "CreateMultipartUpload" API to get an upload ID, then breaks the video file into chunks (typically 5MB to 5GB each). The client uploads 3-5 chunks in parallel for better throughput, and each successful chunk upload returns an ETag (basically a receipt). Once all chunks are uploaded, you call "CompleteMultipartUpload" with all the ETags to finalize the process.

This approach rocks because if a chunk fails, you just retry that specific chunk rather than the entire upload. Multiple chunks uploading simultaneously means faster uploads, and S3 handles all the stitching and validation automatically. You're not reinventing the wheel, which makes it cost effective. The beauty is that other cloud providers like Google Cloud Storage and Azure Blob Storage have similar mechanisms, so this pattern works across different platforms.

Implementing Your Own Chunking Logic

Now, if you're not using S3, or if your interviewer wants to see you implement this manually (some people love the nitty-gritty details), here's how you'd build your own chunking system.

The manual chunking process starts with the client splitting the video file into chunks (say 10MB each), calculating the total chunks needed, and generating a unique upload_id. During the upload process, each chunk gets uploaded to S3 with a structured key like uploads/{upload_id}/chunk_{number}, and the chunk metadata gets updated in DynamoDB upon successful upload. The client can upload multiple chunks in parallel for better performance.

For resume logic, if an upload fails, the client simply queries DynamoDB to see which chunks are missing and only re-uploads those failed chunks - no need to restart from scratch! Once all chunks are uploaded, you mark the upload as complete and trigger the video processing pipeline.

Which Solution Should You Choose?

In a real interview, I'd typically just explain S3 multipart upload and call it a day - it's battle-tested, efficient, and shows you understand existing solutions. But if your interviewer wants to dig deeper into the implementation details, the manual chunking approach shows you can think through the data modeling and edge cases.

Both approaches solve the core problem: users can resume uploads, network failures don't ruin your day, and large files get uploaded efficiently. The key is picking the right tool for your specific constraints and requirements.

Don't Upload the Entire File at Once

One critical mistake that many developers make is attempting to upload large video files as a single monolithic request. This approach is fundamentally flawed because most web servers and load balancers have request timeout limits of 30-60 seconds, while a 2GB video upload could easily take 10+ minutes on a slow connection.

Additionally, uploading entire files at once can consume massive amounts of server memory when multiple users upload simultaneously, and creates an all-or-nothing scenario where a network hiccup at 99% completion forces users to restart completely. Mobile devices are particularly susceptible to these issues due to network interruptions and memory constraints.

The chunked approaches we discussed above solve all of these problems by breaking uploads into manageable pieces that can be retried independently, providing a much better user experience and more reliable system.

CDN

A common follow-up question in system design interviews is, "How can we make this faster for users all over the world?" When you're serving huge files like videos, the distance between your user and your server really matters. The best way to solve this is with a Content Delivery Network (CDN).

Content Delivery Networks (CDNs)

A Content Delivery Network is a geographically distributed network of proxy servers that cache and serve content from locations closest to end users. Instead of every video request traveling back to your origin servers (which might be in a single AWS region), CDNs maintain copies of your content at edge locations around the world, dramatically reducing latency and bandwidth costs.

The major CDN providers include AWS CloudFront, Google Cloud CDN, Cloudflare, and Fastly. CloudFront integrates seamlessly with S3 and offers over 400 edge locations globally, making it a natural choice if you're already using AWS infrastructure. Cloudflare provides excellent DDoS protection and analytics alongside CDN services, while Fastly offers real-time configuration changes and advanced caching controls that are particularly useful for live streaming scenarios.

For our YouTube system, the CDN integration works through a push-based model where we proactively upload content to the CDN after transcoding completes. Once a video finishes processing and all the HLS/DASH segments are generated, we use the CDN's API to upload the video segments, thumbnails, and manifest files to edge locations. This typically happens within 5-10 minutes of transcoding completion for popular content, while less popular videos might be cached on-demand when first requested.

The CDN also handles cache invalidation when we need to update content - for example, if we regenerate thumbnails or fix metadata issues. Most CDNs offer TTL (Time To Live) settings that automatically expire cached content after a specified period, typically 24-48 hours for video content, ensuring that updates eventually propagate to all edge locations without manual intervention.

Loading diagram...

Complete Design

Now that we've covered all the major components individually, let's look at how everything fits together in our complete YouTube system design. This diagram shows the end-to-end flow from video upload through transcoding to global content delivery.

Loading diagram...

The complete architecture demonstrates how users upload videos through resumable upload mechanisms to our blob storage, which triggers our transcoding pipeline (orchestrated by either Airflow or Temporal) to generate multiple quality versions and adaptive bitrate streams. The processed content is then distributed globally through our CDN network, enabling low-latency streaming to billions of users worldwide. Each component we've discussed - from chunked uploads to HLS/DASH manifest generation to edge caching - plays a crucial role in delivering the seamless video experience that users expect from a platform like YouTube.