7 Key Steps to Understand How Elasticsearch Works: A Comprehensive Guide

Elasticsearch is a powerful distributed search and analytics engine widely used for its speed, scalability, and ability to handle large volumes of data in near real-time. At its core, it is based on the Lucene library, which is renowned for its text-searching capabilities.

Let’s explore how it works, from its basic components to how it performs searches efficiently.

Elasticsearch

Core Concepts of Elasticsearch

Before diving into how Elasticsearch works, it’s crucial to understand its core concepts:

  • Document: The smallest unit of data in it, a document is essentially a JSON object that contains various fields and values.
  • Index: An index is a collection of documents that have similar characteristics. For example, if you’re storing data about books, all book records would be stored in the same index.
  • Cluster: A group of one or more nodes (servers) that work together to store data and perform indexing and searching.
  • Node: A single server that holds data and participates in the cluster. Each node can contain one or more indexes.
  • Shard: Indexes are split into smaller pieces called shards. Each shard is essentially a full Lucene index and can be stored on any node in the cluster.
  • Replica: To ensure high availability, each shard can have one or more replicas, which are copies of the original shard stored on different nodes.

How Elasticsearch Indexes Data

When you ingest data into Elasticsearch, it goes through a series of transformations before storing it. Here’s the step-by-step process

  • Document Creation: It accepts data in JSON format. This JSON data represents a document, and it’s sent to a specific index.
  • Indexing Process: Once the document is submitted, Elasticsearch does the following:
    • Tokenization: The text is broken down into individual tokens or words.
    • Inverted Index: It creates an inverted index that maps terms (words) to the documents in which they appear. This structure allows Elasticsearch to locate documents based on search queries quickly.
  • Sharding and Replication: Once the document is tokenized and indexed, it’s stored across various shards. Shards ensure that large datasets are distributed across multiple nodes, making searches faster and fault-tolerant.
    For example, an index with 1,000,000 documents could be split into 5 primary shards, each holding 200,000 documents. These shards can be placed on different nodes within the cluster.
  • Indexing: When a document is added to Elasticsearch, it is first parsed and analyzed. The document’s content is broken down into tokens, which are then indexed along with their associated metadata. It uses an inverted index structure, where terms are mapped to the documents that contain them. This allows for efficient searching by querying on specific terms.
  • Search: When a search query is executed, Elasticsearch analyzes the query and converts it into a series of terms. It then searches the inverted index to find documents that match the query terms. The search results are ranked based on relevance, which is determined by factors such as term frequency, inverse document frequency, and field length.
  • Aggregation: It provides powerful aggregation capabilities that allow users to group and summarize data. Common aggregations include terms aggregation, bucket aggregation, and metric aggregation. Aggregations can be used to analyze trends, patterns, and outliers in data.

Querying Elasticsearch

When performing a search in Elasticsearch, the process begins by sending a query to the cluster. It performs the following steps:

  • Query Parsing: The query, written in It’s Query DSL (Domain-Specific Language), is parsed. For instance, if you want to search for the term “apple” in an index, your query may look like this:
    {
    "query": {
    "match": {
    "content": "apple"
    }
    }
    }
  • Search Coordination: The node that receives the query becomes the coordinating node. It distributes the query to all relevant shards (both primary and replicas) in parallel. This means that if your data is split into multiple shards, the query will be sent to all those shards simultaneously.
  • Scoring and Ranking: Each shard processes the query and returns its results to the coordinating node. These results are scored based on relevance, which is calculated using the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, and other factors like proximity and field boosts.
  • Merging Results: Once all shards return their results, the coordinating node merges them and returns the final result to the user.

One of Its standout features is its near real-time search capabilities. As soon as new data is indexed, it becomes searchable almost instantly. This is achieved by maintaining an in-memory buffer that periodically flushes data to disk in the form of Lucene segments.

  • Refreshing: Elasticsearch automatically “refreshes” the index every second by default, meaning newly indexed documents become searchable very quickly, though not immediately. This refresh cycle allows it to balance between search performance and data availability.

Distributed Nature of Elasticsearch

One of the reasons it can handle massive datasets and still return results in milliseconds is its distributed architecture. Data is spread across multiple nodes and shards, and each shard is a fully functional and independent search engine.

This means that the workload is distributed, and even in the event of node failure, the data is still available because of shard replication. Elastic search’s cluster is designed to handle.

  • Horizontal scaling by adding more nodes.
  • Automatic balancing of data across nodes.
  • Failover and recovery in case of node failure.

How Elasticsearch Handles Write and Read Operations

  • Write Operation (Indexing) When a document is indexed:
    • It is first placed in a translog (a temporary write-ahead log).
    • It acknowledges the write and then moves the document to the in-memory buffer.
    • During the refresh cycle, the in-memory buffer is written to the disk as a segment.
  • Read Operation (Search) During a search operation:
    • It queries both the in-memory buffer and the segments stored on disk.
    • It combines and returns results, ensuring both fresh and old data are searchable.

Key Features of Elasticsearch

  • Scalability: It can scale horizontally by adding more nodes to a cluster, making it suitable for handling large datasets.
  • Speed: It is known for its fast search performance, even on large indexes.
  • Relevance: It uses sophisticated ranking algorithms to return the most relevant search results.
  • Full-text search: It supports full-text search, allowing users to search for specific terms or phrases within documents.
  • Geolocation search: It can index and search on geographic data, making it useful for location-based applications.
  • Analytics capabilities: It provides powerful analytics capabilities through aggregations and other tools.
  • Plugins: It has a rich ecosystem of plugins that extend its functionality, such as plugins for security, monitoring, and integration with other systems.

Elasticsearch is one of the most popular search engines today. Learn more about other search engines and how they compare.

Navigating the Search Landscape: A Comprehensive Guide to Search Engines. Learn about here

Example of Indexing a Document

Imagine we are indexing a new document into an Elasticsearch index called books. Here’s a JSON representation of the document.

{ "title": "Elasticsearch Basics", "author": "John Doe", "description": "A beginner's guide to Elasticsearch", "published_date": "2023-08-15", "genre": "Technology", "price": 29.99 }

Sending the Document to Index

To add this document to the index of the book, the following HTTP request is made:

POST /books/_doc/1 { "title": "Elasticsearch Basics", "author": "John Doe", "description": "A beginner's guide to Elasticsearch", "published_date": "2023-08-15", "genre": "Technology", "price": 29.99 }

Processing the Document

Tokenization: The text fields (like title, and description) are tokenized into smaller components. For example, “Elasticsearch Basics” will be split into two tokens: elastic search and basics.
Inverted Index Creation: An inverted index will map tokens (words) to document IDs where these tokens occur. This allows it to quickly retrieve documents containing specific terms.

TermDocument IDs
Elasticsearch1
basics1
guide1
Each term is associated with the document (in this case, document ID 1).

Sharding and Storing

Let’s assume the index of the book is divided into 5 primary shards. When the document is indexed, it uses a routing mechanism (based on the document’s ID or other fields) to assign this document to one of the shards. For example, it may route the document to Shard 2.

ShardNumber of Documents
1200,000
2150,001 (after adding this document)
3200,000
4200,000
5200,000
Once stored in the shard, Elasticsearch replicates this shard onto another node for fault tolerance.

Query Execution Flow In Elasticsearch

  1. Coordinating Node: The node receiving the query will act as the coordinating node. It will distribute the query to all relevant shards (Shard 1 to Shard 5).
  2. Search Across Shards: Each shard will execute the query locally. For instance:
    • Shard 1 doesn’t contain any documents with the term “guide.”
    • Shard 2 contains our previously indexed document (“Elasticsearch Basics”), where “guide” is part of the description.
  3. Merging Results: Once each shard returns its results, the coordinating node merges them into a final result set.
{ "hits": { "total": 1, "hits": [ { "_index": "books", "_id": "1", "_source": { "title": "Elasticsearch Basics", "author": "John Doe", "description": "A beginner's guide to Elasticsearch", "published_date": "2023-08-15", "genre": "Technology", "price": 29.99 } } ] } }

Here, the document we indexed earlier appears in the results because it contains the term “guide.”

Example of Relevance Scoring

Elasticsearch uses Term Frequency-Inverse Document Frequency (TF-IDF) and BM25 algorithms to score how relevant a document is to a search query.

Let’s say we now query for books containing “Elasticsearch” in the title:

GET /books/_search { "query": { "match": { "title": "Elasticsearch" } } }
  1. TF-IDF Scoring: TF-IDF is calculated as follows
    A high TF and low IDF means the term is very relevant to the document, resulting in a higher score.
    • Term Frequency (TF): How often the term appears in the document. In our case, the term “Elasticsearch” appears once in the document.
    • Inverse Document Frequency (IDF): How rare the term is across all documents. If “Elasticsearch” is common across many documents, its weight decreases.
  2. Result with Scores:
    The _score field represents the relevance score based on how well the document matches the search term.
{ "hits": { "total": 1, "max_score": 1.2, "hits": [ { "_index": "books", "_id": "1", "_score": 1.2, "_source": { "title": "Elasticsearch Basics", "author": "John Doe", "description": "A beginner's guide to Elasticsearch", "published_date": "2023-08-15", "genre": "Technology", "price": 29.99 } } ] } }

Conclusion

Elasticsearch stands as one of the most powerful and flexible search engines available today, specifically designed to handle the demands of modern, data-heavy applications. Its distributed architecture ensures that it can scale effortlessly while maintaining high-speed performance, making it ideal for both full-text search and near-real-time data indexing.

With foundational concepts such as inverted indexes, sharding, and replication, Elasticsearch can process even the most complex queries with impressive speed and accuracy. This makes it an essential tool for businesses and applications that need to manage large volumes of data, whether for search functionality, analytics, or logging.

By distributing data across multiple nodes and ensuring fault tolerance through replication, Elasticsearch guarantees that your data remains accessible and searchable, even in the face of hardware failures. Whether you need to perform simple keyword searches or run sophisticated, multi-faceted queries, Elasticsearch excels at providing fast, relevant results, making it indispensable for a wide range of use cases—from e-commerce websites to real-time logging systems.

Share:
Comments: