Design a URL Shortener (TinyURL)
Variants:
TinyURL System Design Requirements
Functional Requirements
URL Shortening
The service must accept a long URL and return a much shorter, unique URL.
URL Redirection
When a user accesses a short URL, they must be redirected to the original long URL.
Custom Aliases
Users should be able to specify a custom alias for their short URL, if available.
Link Expiration
Users should be able to set an expiration date for their short URLs.
Non-Functional Requirements
High Availability
The service must be highly available, with minimal downtime.
Low Latency
URL redirection should happen in under 100ms.
Scalability
The system must be able to handle billions of URLs and redirects per month.
Durability
Once a short URL is created, it should not be lost.
CAP Theorem Trade-offs
Trade-off Explanation:
For a URL shortener, we prioritize Availability and Partition Tolerance (AP). It's more important to be able to create and redirect URLs than to have strict consistency. Eventual consistency is acceptable, as a slight delay in a new short URL propagating through the system is not a critical issue.
Scale Estimates
API Design
The API for our URL shortener will be simple and focused on two main operations: creating a short URL and redirecting a short URL to its original long URL.
The creation endpoint will be a POST request that takes a long URL and returns a short URL. The redirect endpoint will be a GET request that takes a short URL and returns a 301 redirect to the original long URL.
Create a new short URL
Redirect to the original long URL
Database Schema
Our database schema will be simple, with a single table to store the mapping between the short code and the original long URL. We'll also include fields for analytics and link expiration.
The urls
table will be the core of our system, with the short_code
as the primary key for fast lookups. We'll also include a long_url
field to store the original URL, a created_at
timestamp, and an optional expires_at
timestamp for links that should expire.
urls
High-Level Architecture
Deep Dive: Hash Generation
The core of our URL shortener is the hash generation strategy. We need a way to generate a unique, short code for each long URL.
Base62 Encoding + Counter
Why: Base62 encoding is a great choice for generating short codes because it's simple, efficient, and produces human-readable, URL-safe strings. It uses a 62-character set (a-z
, A-Z
, 0-9
) to represent a large number in a compact format.
How it works: We can use a distributed counter to generate a unique, incrementing integer for each new URL. We then convert this integer to a base62 string. For example, the integer 1000
would be gE
in base62. This approach guarantees that each short code is unique and that we can generate a massive number of them.
Distributed Counter Implementation: A distributed counter is essential for ensuring that we don't generate the same ID for two different URLs. Here are a few ways we could implement this:
-
Redis: We can use Redis's
INCR
command, which is an atomic operation that increments a counter and returns the new value. This is very fast and simple to implement. To ensure persistence, we can use Redis's AOF (Append-Only File) or RDB (Redis Database) snapshotting features. To make it distributed and safe, we can use Redis Sentinel for high availability or Redis Cluster for sharding the counter across multiple nodes. -
Zookeeper: Zookeeper is a distributed coordination service that can be used to implement a reliable distributed counter. It provides strong consistency guarantees, but it's more complex to set up and manage than Redis.
-
Database Sequence: Most relational databases provide a way to generate unique, incrementing numbers using sequences. This is a simple and reliable option, but it can be a bottleneck if not designed carefully.
Trade-offs:
- Pros: Guarantees uniqueness, simple to implement, and the resulting short codes are a predictable length.
- Cons: Requires a centralized counter, which can be a single point of failure if not designed carefully. The choice of counter implementation (Redis, Zookeeper, or a database) will depend on the specific requirements of the system in terms of performance, scalability, and operational overhead.
MD5 Hash
Why: Using a cryptographic hash function like MD5 is another popular approach. It takes the long URL as input and produces a 128-bit hash. We can then take the first 6-8 characters of the hash as our short code.
How it works: When a user submits a long URL, we compute its MD5 hash and take the first 7 characters. We then check if this short code already exists in our database. If it does, we can take the next 7 characters, or apply a different hashing strategy, until we find a unique code.
Trade-offs:
- Pros: Stateless and doesn't require a centralized counter. The same long URL will always produce the same hash, which can be useful for preventing duplicate entries.
- Cons: Hash collisions are possible, so we need to have a strategy for handling them. The resulting short codes are not guaranteed to be unique, so we need to check for existence in the database before using them.
Deep Dive: Database vs. Key-Value Store
The choice of data store is another critical decision. We need a solution that can handle a high volume of reads and writes with low latency.
Relational Database (e.g., PostgreSQL)
Why: A traditional relational database is a solid choice for a URL shortener, especially at a smaller scale. It's easy to set up, provides strong consistency guarantees, and is familiar to most developers.
How it works: We would have a single urls
table with the short_code
as the primary key. We can add an index to the long_url
column to quickly check for duplicates.
Trade-offs:
- Pros: Strong consistency, easy to set up and manage.
- Cons: Can be a bottleneck at very high traffic volumes. Scaling a relational database horizontally can be complex.
Key-Value Store (e.g., Redis, DynamoDB)
Why: A key-value store is a better choice for a large-scale URL shortener. These systems are designed for high-throughput reads and writes and can be easily scaled horizontally.
How it works: We would use the short_code
as the key and the long_url
as the value. This allows for extremely fast lookups. We can also use a separate table or a secondary index to store the mapping from long_url
to short_code
to prevent duplicate entries.
Trade-offs:
- Pros: Highly scalable, low-latency reads and writes.
- Cons: Weaker consistency guarantees than a relational database. May require more operational overhead to manage.
301 vs. 302 Redirects
When a user accesses a short URL, we need to redirect them to the original long URL. There are two types of redirects we can use:
- 301 (Permanent) Redirect: This tells the browser that the redirect is permanent. The browser will cache the redirect and subsequent requests for the short URL will not hit our servers. This is good for performance, but it means we can't collect analytics on subsequent clicks.
- 302 (Temporary) Redirect: This tells the browser that the redirect is temporary. The browser will not cache the redirect, so every request for the short URL will hit our servers. This allows us to collect analytics on every click, but it comes at the cost of higher server load.
For a URL shortener, a 302 redirect is almost always the better choice, as it allows us to collect valuable analytics data.
Deep Dive: Analytics
A key feature of any URL shortener is the ability to track analytics for each link. This allows users to see how many people are clicking on their links, where they are coming from, and what devices they are using. To do this, we need to capture information about each click and process it in a scalable way.
Real-time Analytics with Kafka and Druid
Why: This approach provides a robust, real-time analytics pipeline that can handle a massive volume of clicks and provide low-latency queries.
How it works:
- Capture: When a user clicks on a short link, our redirect service is responsible for two things: performing the 302 redirect to the original long URL, and asynchronously sending a detailed event message to a Kafka topic. This message is a JSON payload containing the
short_code
,timestamp
,user_agent
,ip_address
, andreferrer
. Using Kafka decouples the analytics processing from the user-facing redirect, ensuring that the redirect is as fast as possible. - Process with Flink: An Apache Flink job consumes the raw click events from the Kafka topic. The Flink pipeline can then perform several transformations in real-time:
- Enrichment: It can enrich the data by, for example, performing a geo-lookup on the IP address to get the user's country and city, or parsing the user agent string to identify the browser and device type.
- Aggregation: Flink can perform windowed aggregations on the stream of events. For example, we can use a tumbling window of 1 minute to count the number of clicks for each short code, and also aggregate by country, city, or device type.
- Store and Query: The aggregated data from Flink is then written to a real-time analytics database like Apache Druid. Druid is a columnar store that is optimized for OLAP (Online Analytical Processing) queries and can provide low-latency results for complex analytical queries. We can then build a dashboard on top of Druid to visualize the analytics data, allowing users to see click counts, geographic distribution, and other metrics in near real-time.
Trade-offs:
- Pros: Highly scalable, real-time analytics, and can handle complex queries.
- Cons: The most complex solution to set up and maintain, with multiple moving parts.
Batch Processing with S3 and Athena
Why: This is a simpler, more cost-effective approach that is well-suited for use cases where real-time analytics are not a strict requirement.
How it works:
- Capture: Similar to the real-time approach, the redirect service sends a message to a Kafka topic.
- Store: A simple Kafka consumer reads the messages and writes them to a durable, long-term storage solution like Amazon S3. The data can be stored in a columnar format like Parquet for efficient querying.
- Query: We can use a serverless query engine like Amazon Athena to run ad-hoc SQL queries on the data in S3. This allows us to perform complex analytical queries without having to manage a separate database.
Trade-offs:
- Pros: Simpler and more cost-effective than a real-time pipeline. Highly scalable and durable.
- Cons: Higher latency for queries. Not suitable for real-time dashboards.
Complete Design
Here is the complete system design, bringing together all the components we've discussed. The client interacts with a load balancer, which distributes requests to a set of web servers. The web servers handle the logic for creating short URLs and redirecting users. A distributed counter service generates unique IDs, and a key-value store holds the mapping between short codes and long URLs. For analytics, click data is sent to a messaging queue and processed by a stream processing system, with the results stored in a data warehouse for querying.