Design Top K Songs on Spotify
Variants:
Spotify Top K Songs Requirements
Functional Requirements
Track Song Plays
The system must be able to ingest a high-throughput stream of song play events.
Top K Queries
Users must be able to query the top K most played songs.
Time Windows
The Top K query must support different time windows (e.g., last hour, last day, last week, all time).
Filtering
The Top K query should optionally support filtering by dimensions like genre or region.
Accuracy
The results must be highly accurate (near real-time and precise counts).
Non-Functional Requirements
High Availability
The system must be highly available to both ingest events and serve queries. A few minutes of downtime in the query service is acceptable, but the ingestion pipeline must be extremely resilient.
High Scalability
The system must scale to handle billions of events per day and a high volume of concurrent queries.
Low Latency
Top K queries should return results in under 200ms.
Fault Tolerance
The system should be resilient to failures in individual components.
Durability
No data should be lost in the event ingestion pipeline.
CAP Theorem Trade-offs
Trade-off Explanation:
For a real-time analytics platform like this, we prioritize Availability and Partition Tolerance (AP). It's more important to keep accepting song play data and serving potentially slightly stale Top-K results than to have the system become unavailable.
Scale Estimates
API Design
Get top K songs for a specific time period
Record a song view event
The API for our system needs to be simple and efficient, serving two primary functions: ingesting song play events and serving Top K queries. The ingestion endpoint will be a high-throughput endpoint that takes in song play events from clients. It's designed to be lightweight and asynchronous, simply forwarding the events to our Kafka pipeline without performing any heavy processing itself.
The query endpoint will allow users to retrieve the Top K songs for various time periods. It will be a GET request with parameters for the time window (e.g., 1h
, 1d
) and the value of K. This endpoint will query our low-latency data store (Redis) to provide near real-time results. We'll also include optional parameters for filtering by genre or region to allow for more granular queries.
Database Schema
Our database schema is designed to support both the real-time ingestion of song play events and the efficient querying of song metadata and aggregated Top K results. We'll use a combination of tables to store raw events, song metadata, and the pre-computed Top K results for different time windows.
The song_views
table will store the raw play events, partitioned by song_id
and sorted by timestamp
to allow for efficient time-based queries. The songs
table will store metadata for each song, such as title, artist, and genre. Finally, the top_k_cache
table will store the pre-computed Top K results, with a composite key of the time period and rank, allowing for extremely fast lookups. This denormalized approach is crucial for meeting our low-latency query requirements.
song_views
songs
top_k_cache
High-Level Architecture
Deep Dive: Core Logic
Here are a couple of ways we can implement the core logic for tracking the Top-K songs.
Apache Flink Streaming
This is a robust, scalable, and precise approach. We can use Apache Flink to build a streaming pipeline that processes song view events in real-time.
Architecture: We'll implement a Kappa Architecture. A stream of events from Kafka is fed into Flink. Flink processes these events in windows (e.g., 1 hour, 1 day) and aggregates the counts for each song.
Aggregation: Flink's keyBy()
operator partitions the stream by songId
. We can then apply a windowing function (like TumblingEventTimeWindows
) to group events into time buckets. Inside the window, a ProcessWindowFunction
can count occurrences of each song.
Top-K Tracking: To efficiently track the top K items within each Flink operator instance, we can use a local min-heap of size K. For each new song count, we compare it to the minimum element in the heap. If it's larger, we remove the minimum and insert the new element. This keeps the memory footprint low.
State Management: Flink's managed state will store the song counts. For durability and fault tolerance, Flink will periodically snapshot its state to a distributed file system like HDFS or S3.
Lambda vs. Kappa: While a Lambda architecture could use batch processing to reconcile data, it adds complexity and isn't ideal for our real-time requirement. The Kappa architecture, with a single streaming pipeline, is a better fit here.
Count-Min Sketch
This is a probabilistic approach that uses a compact data structure to estimate item frequencies. It's less precise than Flink but uses significantly less memory. It's worth noting that this approach has been popularized by other system design courses, and as a result, many interviewers may expect you to be familiar with it. While it may not be the best option for a production system requiring high accuracy, it's a good topic to have in your toolkit for interviews.
How it works: A Count-Min Sketch is a 2D array of counters. For each incoming song view, we hash the songId
with multiple hash functions, each corresponding to a row in the array. We increment the counter at the hashed index in each row.
Querying: To get the estimated count for a song, we hash its songId
with the same hash functions and take the minimum of the counter values at the resulting indices. This minimum value is the estimated frequency.
Trade-offs: This approach is very memory-efficient and fast, but it can overestimate counts due to hash collisions. It's a good choice when some inaccuracy is acceptable. However, for a precise system like Spotify's top charts, a deterministic approach like Flink is likely necessary.
Deep Dive: Sink Options
Once Flink has processed the data, we need to sink the results somewhere for fast querying.
Sink to Redis
Why: Redis is an excellent choice for serving real-time Top-K queries due to its in-memory nature and specialized data structures. The Sorted Set is particularly well-suited for this use case, as it maintains a collection of items ordered by a score.
How it works: We can use the Flink Redis Connector to sink our aggregated song counts into a Redis Sorted Set. The songId
would be the member, and the view_count
would be the score. To get the top K songs, we can simply use the ZREVRANGE
command, which returns the top K items in O(log(N) + K) time, where N is the number of members in the set.
Trade-offs:
- Pros: Extremely fast reads and writes, perfect for low-latency Top-K queries.
- Cons: Redis is an in-memory store, so it can be expensive at very large scales. While Redis has persistence options, it's typically used as a cache or for data that can be rebuilt from a source of truth.
Sink to Elasticsearch
Why: Elasticsearch is a powerful search and analytics engine that can be a great sink for our Top-K data, especially if we need to expose it through a more complex search API or an internal analytics dashboard.
How it works: We can use the Flink Elasticsearch Connector to index our song count data. Each document would contain the songId
, view_count
, timestamp
, and any other relevant metadata like genre
or region
. We can then use Elasticsearch's powerful aggregation and sorting capabilities to query the top K songs.
Trade-offs:
- Pros: Highly flexible querying, full-text search capabilities, great for building rich UIs and dashboards.
- Cons: Higher operational complexity than Redis. While fast, it may not match the raw speed of Redis for simple Top-K queries.
Sink to a Columnar Database
Why: If the business requirements extend beyond simple Top-K queries to more complex, ad-hoc analytical queries, a columnar database like Apache Pinot, ClickHouse, or Druid is the best choice.
How it works: These databases are optimized for OLAP (Online Analytical Processing) workloads. We would sink our Flink data into one of these systems, and they would handle the aggregation and querying. They are designed to be highly scalable and can handle massive datasets with ease.
Trade-offs:
- Pros: The best option for large-scale, complex analytical queries. Highly scalable and performant for the right use case.
- Cons: The most complex solution to set up and maintain. Overkill if the only requirement is to serve simple Top-K queries.
File or S3 Sink for Snapshots
In addition to a real-time sink, it's a good practice to also sink snapshots of the data to a durable, long-term storage solution like Amazon S3.
- Use Case: This is useful for offline analytics, ad-hoc querying with tools like Athena or Spark, and for backup/archival purposes.
- Implementation: Use Flink's
FileSink
orStreamingFileSink
to export snapshots of the top-k results every minute or hour. The data can be stored in a format like Parquet or Avro for efficient querying.
Bot/Fraudulent Traffic Filtering
A crucial aspect of a view counting system is to filter out inorganic traffic. This can be done at various stages:
- Upstream: A dedicated service can pre-filter events before they even reach Kafka.
- In Flink: We can add a
filter()
step in our Flink pipeline to discard events based on certain heuristics (e.g., too many views from a single IP address in a short time, unusual user agent strings). - Post-processing: We can have a separate batch job that analyzes historical data to identify fraudulent patterns and retroactively adjust the counts.
Top-K by Different Dimensions
The problem asks for Top-K songs, but we might also want to get Top-K by genre, location, or user. We can achieve this by modifying our Flink job. Instead of just keying by songId
, we can create a composite key like (genre, songId)
or (location, songId)
. This will give us a separate Top-K list for each dimension.
Complete Design
Here is the complete, end-to-end design of our system for tracking the top K songs on Spotify. This diagram illustrates how data flows from the client, through our ingestion and processing pipeline, and finally to the various data stores that serve our API. It brings together the concepts of real-time stream processing with Flink, the use of Redis for fast caching, and the option to sink data to other systems for further analysis.