Elasticsearch

Medium

Elasticsearch is a distributed, open-source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.

Variants:

Apache SolrAlgoliaAmazon CloudSearch

What is Elasticsearch?

Elasticsearch is known for its simple REST APIs, distributed nature, speed, and scalability. It's the central component of the Elastic Stack, a set of open-source tools for data ingestion, enrichment, storage, analysis, and visualization. Your interviewer will expect you to understand the core concepts of Elasticsearch and when to use it in a system design.

Core Concepts of Elasticsearch

Documents and Indexes: Elasticsearch is a document-oriented database. Each document is a collection of fields, which are the key-value pairs that contain your data. Documents are stored in indexes. An index is a collection of documents that have somewhat similar characteristics.
Clusters, Nodes, and Shards: An Elasticsearch cluster is a collection of one or more nodes (servers) that together holds your entire dataset and provides federated indexing and search capabilities across all nodes. Each index is split into one or more shards, and each shard can have zero or more replicas.
Inverted Index: Elasticsearch uses a data structure called an inverted index, which is designed to allow for very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word appears in.
REST API: Elasticsearch provides a simple, coherent REST API for managing your cluster and indexing and searching your data.

Indexing and Sharding

Your interviewer will expect you to understand how Elasticsearch stores and organizes data. The core concepts are indexes, shards, and replicas.

Indexes: An index is a collection of documents that have somewhat similar characteristics. For example, you could have an index for customer data, another for a product catalog, and another for order data.
Shards: Because Elasticsearch is a distributed search engine, an index is usually split into a number of shards. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster. By distributing the documents in an index across multiple shards, and distributing those shards across multiple nodes, Elasticsearch can ensure redundancy, which both protects against hardware failures and increases query capacity as nodes are added to a cluster.
Replicas: Elasticsearch allows you to make one or more copies of your index’s shards, which are called replica shards or replicas. Replication provides high availability in case a shard or node fails. It also allows you to scale out your search volume because searches can be executed on all replicas in parallel.

Primary and Replica Shards

Each document in your index belongs to one primary shard. The number of primary shards in your index is fixed at the time of index creation. A replica shard is a copy of a primary shard. A replica shard is never on the same node as its primary shard.

Querying and Aggregations

Elasticsearch provides a powerful, JSON-based Domain Specific Language (DSL) for querying your data. Your interviewer will want to know that you have a basic understanding of how to query Elasticsearch.

Query DSL: The Query DSL is a flexible, expressive query language that allows you to build complex queries. It supports a wide variety of query types, including full-text queries, term-level queries, and geospatial queries.
Aggregations: The aggregations framework helps provide aggregated data based on a search query. It is based on simple building blocks called aggregations, that can be composed in order to build complex summaries of the data. An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents.
Relevance Scoring: By default, Elasticsearch sorts matching results by their relevance score, which measures how well each document matches a query. The relevance score is a positive floating-point number, returned in the _score metadata field of the search results. The higher the _score, the more relevant the document.

How to Use Elasticsearch in a System Design Interview

When you're in a system design interview, you should be able to articulate why you would choose Elasticsearch over other search solutions and how you would use it in your architecture.

Here are some key points to mention:

Full-Text Search: If your application requires powerful full-text search capabilities, Elasticsearch is a great choice.
Scalability: Elasticsearch is a distributed system that can be scaled out by adding more nodes to the cluster.
Analytics: Elasticsearch is not just for search. It's also a powerful analytics engine that can be used to aggregate and visualize your data.
Data Ingestion: You can use Logstash or Beats to ingest data into Elasticsearch from a variety of sources.
Trade-offs: While Elasticsearch is a powerful tool, it's not a silver bullet. It's not a good choice for applications that require strong transactional guarantees. It also has a steep learning curve and can be complex to manage at scale.

By discussing these points, you'll demonstrate to your interviewer that you have a solid understanding of Elasticsearch and how to use it to build powerful search and analytics applications.

Example System Design Problems

Here are a few examples of system design problems where you might use Elasticsearch:

Design Yelp: Elasticsearch is a great choice for the search functionality of a Yelp-like service. It can be used to index all the business data and provide fast, relevant search results with support for geospatial queries.
Design Twitter: The search functionality of Twitter could be powered by Elasticsearch. It would allow users to search for tweets, users, and hashtags.
Design an E-Commerce Platform: Elasticsearch could be used to power the product search for an e-commerce platform. It would allow users to search for products by name, category, and other attributes.