Design Top K Songs on Spotify

Medium

Let's assume we have a very large stream of views on Spotify (our stream is a firehose of SongDs). At any given moment we'd like to be able to query, precisely, the top K most viewed songs for a given time period (say 1 hour, 1 day, 1 month, all time) together with their counts.

Variants:

Top K products on AmazonTrending hashtags on TwitterMost viewed videos on YouTube

Spotify Top K Songs Requirements

Functional Requirements

Track Song Plays

The system must be able to ingest a high-throughput stream of song play events.

Top K Queries

Users must be able to query the top K most played songs.

Time Windows

The Top K query must support different time windows (e.g., last hour, last day, last week, all time).

Filtering

The Top K query should optionally support filtering by dimensions like genre or region.

Accuracy

The results must be highly accurate (near real-time and precise counts).

Non-Functional Requirements

High Availability

The system must be highly available to both ingest events and serve queries. A few minutes of downtime in the query service is acceptable, but the ingestion pipeline must be extremely resilient.

High Scalability

The system must scale to handle billions of events per day and a high volume of concurrent queries.

Low Latency

Top K queries should return results in under 200ms.

Fault Tolerance

The system should be resilient to failures in individual components.

Durability

No data should be lost in the event ingestion pipeline.

CAP Theorem Trade-offs

Trade-off Explanation:

For a real-time analytics platform like this, we prioritize Availability and Partition Tolerance (AP). It's more important to keep accepting song play data and serving potentially slightly stale Top-K results than to have the system become unavailable.

Scale Estimates

User Base500M users (5 * 10^8)

Base assumption for system sizing

Song Play Events/Day50B events (5 * 10^10)

The number of events our ingestion pipeline must handle daily.

Peak Events/Second1M events

Event Size1KB

Daily Data Ingestion50TB

Yearly Data Ingestion18.25PB

Loading diagram...

API Design

GET/api/v1/songs/top/{period}

Get top K songs for a specific time period

POST/api/v1/events/view

Record a song view event

The API for our system needs to be simple and efficient, serving two primary functions: ingesting song play events and serving Top K queries. The ingestion endpoint will be a high-throughput endpoint that takes in song play events from clients. It's designed to be lightweight and asynchronous, simply forwarding the events to our Kafka pipeline without performing any heavy processing itself.

The query endpoint will allow users to retrieve the Top K songs for various time periods. It will be a GET request with parameters for the time window (e.g., 1h, 1d) and the value of K. This endpoint will query our low-latency data store (Redis) to provide near real-time results. We'll also include optional parameters for filtering by genre or region to allow for more granular queries.

Database Schema

Our database schema is designed to support both the real-time ingestion of song play events and the efficient querying of song metadata and aggregated Top K results. We'll use a combination of tables to store raw events, song metadata, and the pre-computed Top K results for different time windows.

The song_views table will store the raw play events, partitioned by song_id and sorted by timestamp to allow for efficient time-based queries. The songs table will store metadata for each song, such as title, artist, and genre. Finally, the top_k_cache table will store the pre-computed Top K results, with a composite key of the time period and rank, allowing for extremely fast lookups. This denormalized approach is crucial for meeting our low-latency query requirements.

song_views

song_id

stringPartition Key

timestamp

timestampSort Key

user_id

string

session_id

string

region

string

platform

string

songs

song_id

stringPartition Key

title

string

artist_id

string

artist_name

string

album

string

genre

string

duration_ms

int

release_date

timestamp

top_k_cache

period_key

stringPartition Key (e.g., "1h_2024-01-15T14")

rank

intSort Key

song_id

string

view_count

int

last_updated

timestamp

High-Level Architecture

Loading diagram...

Deep Dive: Core Logic

Here are a couple of ways we can implement the core logic for tracking the Top-K songs.

Apache Flink Streaming

This is a robust, scalable, and precise approach. We can use Apache Flink to build a streaming pipeline that processes song view events in real-time.

Architecture: We'll implement a Kappa Architecture. A stream of events from Kafka is fed into Flink. Flink processes these events in windows (e.g., 1 hour, 1 day) and aggregates the counts for each song.

Aggregation: Flink's keyBy() operator partitions the stream by songId. We can then apply a windowing function (like TumblingEventTimeWindows) to group events into time buckets. Inside the window, a ProcessWindowFunction can count occurrences of each song.

Top-K Tracking: To efficiently track the top K items within each Flink operator instance, we can use a local min-heap of size K. For each new song count, we compare it to the minimum element in the heap. If it's larger, we remove the minimum and insert the new element. This keeps the memory footprint low.

State Management: Flink's managed state will store the song counts. For durability and fault tolerance, Flink will periodically snapshot its state to a distributed file system like HDFS or S3.

Lambda vs. Kappa: While a Lambda architecture could use batch processing to reconcile data, it adds complexity and isn't ideal for our real-time requirement. The Kappa architecture, with a single streaming pipeline, is a better fit here.

Count-Min Sketch

This is a probabilistic approach that uses a compact data structure to estimate item frequencies. It's less precise than Flink but uses significantly less memory. It's worth noting that this approach has been popularized by other system design courses, and as a result, many interviewers may expect you to be familiar with it. While it may not be the best option for a production system requiring high accuracy, it's a good topic to have in your toolkit for interviews.

How it works: A Count-Min Sketch is a 2D array of counters. For each incoming song view, we hash the songId with multiple hash functions, each corresponding to a row in the array. We increment the counter at the hashed index in each row.

Querying: To get the estimated count for a song, we hash its songId with the same hash functions and take the minimum of the counter values at the resulting indices. This minimum value is the estimated frequency.

Trade-offs: This approach is very memory-efficient and fast, but it can overestimate counts due to hash collisions. It's a good choice when some inaccuracy is acceptable. However, for a precise system like Spotify's top charts, a deterministic approach like Flink is likely necessary.

Deep Dive: Sink Options

Once Flink has processed the data, we need to sink the results somewhere for fast querying.

Sink to Redis

Why: Redis is an excellent choice for serving real-time Top-K queries due to its in-memory nature and specialized data structures. The Sorted Set is particularly well-suited for this use case, as it maintains a collection of items ordered by a score.

How it works: We can use the Flink Redis Connector to sink our aggregated song counts into a Redis Sorted Set. The songId would be the member, and the view_count would be the score. To get the top K songs, we can simply use the ZREVRANGE command, which returns the top K items in O(log(N) + K) time, where N is the number of members in the set.

Trade-offs:

Pros: Extremely fast reads and writes, perfect for low-latency Top-K queries.
Cons: Redis is an in-memory store, so it can be expensive at very large scales. While Redis has persistence options, it's typically used as a cache or for data that can be rebuilt from a source of truth.

Sink to Elasticsearch

Why: Elasticsearch is a powerful search and analytics engine that can be a great sink for our Top-K data, especially if we need to expose it through a more complex search API or an internal analytics dashboard.

How it works: We can use the Flink Elasticsearch Connector to index our song count data. Each document would contain the songId, view_count, timestamp, and any other relevant metadata like genre or region. We can then use Elasticsearch's powerful aggregation and sorting capabilities to query the top K songs.

Trade-offs:

Pros: Highly flexible querying, full-text search capabilities, great for building rich UIs and dashboards.
Cons: Higher operational complexity than Redis. While fast, it may not match the raw speed of Redis for simple Top-K queries.

Sink to a Columnar Database

Why: If the business requirements extend beyond simple Top-K queries to more complex, ad-hoc analytical queries, a columnar database like Apache Pinot, ClickHouse, or Druid is the best choice.

How it works: These databases are optimized for OLAP (Online Analytical Processing) workloads. We would sink our Flink data into one of these systems, and they would handle the aggregation and querying. They are designed to be highly scalable and can handle massive datasets with ease.

Trade-offs:

Pros: The best option for large-scale, complex analytical queries. Highly scalable and performant for the right use case.
Cons: The most complex solution to set up and maintain. Overkill if the only requirement is to serve simple Top-K queries.

File or S3 Sink for Snapshots

In addition to a real-time sink, it's a good practice to also sink snapshots of the data to a durable, long-term storage solution like Amazon S3.

Use Case: This is useful for offline analytics, ad-hoc querying with tools like Athena or Spark, and for backup/archival purposes.
Implementation: Use Flink's FileSink or StreamingFileSink to export snapshots of the top-k results every minute or hour. The data can be stored in a format like Parquet or Avro for efficient querying.

Bot/Fraudulent Traffic Filtering

A crucial aspect of a view counting system is to filter out inorganic traffic. This can be done at various stages:

Upstream: A dedicated service can pre-filter events before they even reach Kafka.
In Flink: We can add a filter() step in our Flink pipeline to discard events based on certain heuristics (e.g., too many views from a single IP address in a short time, unusual user agent strings).
Post-processing: We can have a separate batch job that analyzes historical data to identify fraudulent patterns and retroactively adjust the counts.

Top-K by Different Dimensions

The problem asks for Top-K songs, but we might also want to get Top-K by genre, location, or user. We can achieve this by modifying our Flink job. Instead of just keying by songId, we can create a composite key like (genre, songId) or (location, songId). This will give us a separate Top-K list for each dimension.

Complete Design

Here is the complete, end-to-end design of our system for tracking the top K songs on Spotify. This diagram illustrates how data flows from the client, through our ingestion and processing pipeline, and finally to the various data stores that serve our API. It brings together the concepts of real-time stream processing with Flink, the use of Redis for fast caching, and the option to sink data to other systems for further analysis.

Loading diagram...