Hide sidebar

Design Dropbox

Design Dropbox
Medium
Dropbox is a file hosting service that offers cloud storage, file synchronization, personal cloud, and client software. This system design will focus on the core functionality of uploading, downloading, and syncing files across multiple devices.

Variants:

Google DriveOneDriveiCloudBox
Loading diagram...

Dropbox System Design Requirements

Functional Requirements

1

File Upload & Download

Users should be able to upload and download files to their Dropbox account.

2

File Sharing

Users should be able to share files and view files shared with them.

3

File Sync

Users should be able to sync files with their OS file system.

Non-Functional Requirements

1

High Availability

System should be highly available, prioritizing over consistency.

2

Security

User data and files should be secure.

3

Large File Support

Upload large files > 50gb.

4

Low Latency

Low Latency for file access and sync.

CAP Theorem Trade-offs

ConsistencyAvailabilityPartition Tolerance
Trade-off Explanation:

Dropbox prioritizes availability and partition tolerance. It's more important for users to be able to access their files, even if it means there's a slight delay in syncing the absolute latest version across all devices.

Scale Estimates

User Base500M users (5 * 10^8)
Base assumption for system sizing
Total Storage500 PB (5 * 10^17 bytes)
Assuming 1TB of storage per user on average.
Files Uploaded/Day1B files (10^9)
Avg File Size500 KB (5 * 10^5 bytes)
Daily Upload Volume500 TB (5 * 10^14 bytes)
POST/v1/metadata

Creates a metadata record for a new file.

Request Body

{
"fileName": "string",
"size": "number",
"mimeType": "string"
}

Response Body

{
"fileId": "string",
"uploadUrl": "https://mybucket.s3.amazonaws.com/..."
}
PUT{uploadUrl}

Uploads the file directly to the S3 presigned URL.

Request Body

rawBinaryFileData

Response Body

Status: 200 OK
GET/v1/url

Retrieves a download URL for a file.

Request Body

{
"fileName": "string"
}

Response Body

{
"downloadUrl": "https://mybucket.s3.amazonaws.com/..."
}

File Metadata Service + Database Schema

The File Metadata Service is the central nervous system for all file-related operations. It's responsible for handling file uploads, downloads, and managing the metadata associated with each file. When a client wants to upload a file, it first communicates with this service to create a metadata record and get an upload URL. Similarly, for downloads, the client asks the service for a secure download link.

Loading diagram...

For the database itself, we'll use a relational database like PostgreSQL. While NoSQL databases like DynamoDB are great for certain use cases, a relational database is a solid choice here because it gives us strong consistency and the ability to perform complex queries, which can be useful for features like file sharing and permissions.

FileMetadata

id
uuidPrimary Key
userId
uuid
filePath
text
lastModified
timestamp
mimeType
string
size
bigint
s3Url
string
encryptionKey
string
uploadComplete
boolean

Shares

fileId
uuidForeign Key
userId
uuidForeign Key

Security and Compression

When you're dealing with user files, security is a top priority. Here’s a straightforward approach to keeping things safe:

  • Encryption at Rest: Before we even save a file to S3, we should encrypt it. A common and strong method is AES-256. We can generate a unique encryption key for each file, use it to encrypt the data, and then store this key right alongside the file's metadata. This way, even if someone gets access to the raw file in S3, it's just scrambled data without the key.

  • Secure Downloads: We should never expose direct links to our S3 files. Instead, when a user wants to download something, we generate a temporary, presigned URL. These links are set to expire after a short time, like a few minutes, which means they can't be passed around and used later.

  • CDN Security: We can also add a layer of security at the CDN level. The CDN can be configured to check for a specific signature on every request it receives. This signature is a hash (like HMAC-SHA256) created from the request details and a secret key. If the signature doesn't match, the CDN rejects the request, preventing unauthorized access. For example, a signature could be generated like this: HMAC-SHA256("/bucket_name/myfile.txt+10min+GET+MySuperSecretKey123!").

Resumable Uploads

Uploading large files is tricky. A flaky internet connection could mean a user has to restart a multi-gigabyte upload from scratch, which is a terrible experience. To solve this, we need a way to make uploads resumable.

The standard approach is to break large files into smaller, more manageable chunks. This way, if an upload fails, the client only needs to re-upload the specific chunks that were lost, rather than the entire file.

1

Leverage Cloud Provider Features

Most cloud storage providers, like Amazon S3, have built-in features to handle this. S3's "Multipart Upload" is a great example. The process is simple: you tell S3 you're starting a multipart upload, and it gives you an uploadId. Then, you can upload each chunk of your file in parallel, and once you're done, you tell S3 to complete the upload. It's robust, efficient, and saves you from having to build this logic yourself.

2

Build a Custom Chunking Service

If you want more control or are using a storage provider without this feature, you can build your own chunking service. This involves creating a system where the client first notifies the server that it wants to upload a file. The server creates a record for the file and its chunks in a database (like DynamoDB). The client then uploads each chunk individually, and the server tracks their progress. If a chunk fails, the client can query the server to see which chunks are missing and re-upload only those.

Which is better?

For most use cases, using the cloud provider's built-in solution is the way to go. It's a solved problem, and it's better to focus your engineering efforts on your core product. However, understanding how to build a custom solution is a great way to demonstrate your understanding of the underlying concepts in an interview.

Avoid Single-File Uploads

Whatever you do, avoid uploading large files in a single request. Most web servers will time out after 30-60 seconds, and a large upload can easily take longer. This approach also consumes a lot of server memory and is very fragile. Chunking is always the better option for large files.

CDN

A common follow-up question in system design interviews is, "How can we make this faster for users all over the world?" When you're serving huge files like documents and videos, the distance between your user and your server really matters. The best way to solve this is with a Content Delivery Network (CDN).

Content Delivery Networks (CDNs)

A CDN is basically a network of servers spread across the globe that keep a copy of your files. When a user in Japan wants to download a file, instead of fetching it from a server in Virginia, they can grab it from a server in Tokyo. This drastically cuts down on loading time.

For a system like Dropbox, you'd use a CDN to store all the user-uploaded files. After a file is uploaded, you'd "push" the content out to all the CDN's locations. Popular services like AWS CloudFront, Cloudflare, and Google Cloud CDN are the usual suspects here. They handle all the complexity of caching and routing traffic to the nearest server for you.

Loading diagram...

Complete Design

This section provides a summary of the complete system design for Dropbox, integrating all the components discussed. We've covered the core functional and non-functional requirements, the API design, the database schema, and the strategies for security, resumable uploads, and content delivery.

Loading diagram...