Elasticsearch Guide: Full-Text Search, Mapping, Analyzers & ELK Stack

What Is Elasticsearch?

Elasticsearch is a distributed, open-source search and analytics engine built on Apache Lucene. It supports real-time full-text search, structured data analysis, log management, and application performance monitoring. Created by Shay Banon in 2010, Elasticsearch has become one of the most widely adopted search engines in the world, powering search functionality for organizations of every size.

Companies like Wikipedia, GitHub, Netflix, and Uber rely on Elasticsearch for various use cases. Its RESTful API interface, JSON-based query language, and horizontal scalability make it a favorite among developers building search-intensive applications.

Inverted Index: The Foundation of Elasticsearch

The secret behind Elasticsearch's exceptional search performance lies in its inverted index data structure. While traditional databases scan documents sequentially, an inverted index pre-maps every term to the documents that contain it, enabling near-instant lookups.

How Does an Inverted Index Work?

Think of the index at the back of a textbook: instead of reading the entire book to find a word, you look it up in the index to see which pages mention it. An inverted index works on exactly this principle, but at massive scale.

Consider three sample documents:

"Elasticsearch is a fast search engine"
"Elasticsearch works in a distributed manner"
"Search engines index data efficiently"

The inverted index maps each term to its document list:

Term	Document List
elasticsearch	[1, 2]
fast	[1]
search	[1, 3]
engine	[1]
distributed	[2]
works	[2]
engines	[3]
data	[3]
index	[3]

This structure enables searching across millions of documents in milliseconds.

Core Elasticsearch Concepts

Index

An index in Elasticsearch is a collection of documents with similar characteristics, analogous to a table in relational databases. Each index can be divided into one or more shards, and each shard is an independent Lucene instance capable of handling search requests.

Document

The fundamental data unit in Elasticsearch is a JSON document. Each document is stored within an index and has a unique identifier. Documents contain fields, and each field can have different data types defined through mapping.

Shards and Replicas

A shard is a horizontal partition of an index, allowing data to be distributed across multiple nodes. Replica shards are copies created for redundancy and improved read performance. If a node fails, replica shards ensure data availability and search continuity.

Mapping and Data Types

Mapping defines the schema of documents in an index, specifying the data type of each field, how it should be indexed, and how it should be searched. Proper mapping design is critical for search performance and result quality.

Common Data Types

text: Analyzed text fields for full-text search
keyword: Exact-match fields for filtering, sorting, and aggregations
integer/long/float/double: Numeric data types
date: Date and timestamp fields
boolean: True/false values
nested: Objects that need independent indexing
geo_point: Geographic coordinate data

Dynamic vs Explicit Mapping

Elasticsearch can automatically create mappings when it encounters new fields (dynamic mapping). However, explicit mapping is strongly recommended for production environments because it gives you complete control over data types and prevents unexpected type conversions that can cause indexing failures.

Analyzers: How Text Gets Indexed

An analyzer defines how text is processed during indexing. It consists of three components that work sequentially to transform raw text into searchable tokens.

Character Filters

Character filters perform character-level transformations on raw text before tokenization. Common operations include stripping HTML tags, converting special characters, and normalizing Unicode characters.

Tokenizer

The tokenizer splits text into individual tokens. The standard tokenizer splits on whitespace and punctuation, while the ngram tokenizer creates character sequences of specified lengths, useful for autocomplete and partial matching.

Token Filters

Token filters transform individual tokens after tokenization. Operations include lowercasing, synonym expansion, stop word removal, and stemming (reducing words to their root form).

For languages with complex morphology, custom analyzers are essential. The standard analyzer may produce suboptimal results. Using language-specific analyzers, ICU plugins, or custom stemming dictionaries significantly improves search relevance.

Search Queries

Match Query

The most fundamental full-text search query. The input text is processed through the field's analyzer, split into terms, and documents containing those terms are returned ranked by relevance score (TF-IDF or BM25).

Term Query

Performs exact-match searches without analysis. Used for filtering on keyword fields. Should not be used on text fields because the indexed tokens may not match the unanalyzed query term.

Bool Query

Combines multiple queries with boolean logic: must (AND), should (OR), must_not (NOT), and filter (unscored filtering). Bool queries are the building blocks of complex search logic in Elasticsearch.

Range Query

Queries numeric or date fields for values within a specified range using operators: gte (greater than or equal), lte (less than or equal), gt (greater than), and lt (less than).

Aggregations

One of Elasticsearch's most powerful features is its aggregation framework, which enables grouping, statistical computation, and data analysis directly within the search engine.

Bucket Aggregations

Bucket aggregations group documents into buckets. The terms aggregation is the most commonly used, similar to SQL's GROUP BY. Histogram and date_histogram aggregations enable time-based analysis and visualization.

Metric Aggregations

Metric aggregations compute statistics over numeric values: avg (average), sum, min, max, and cardinality (approximate distinct count). These provide real-time analytics without requiring a separate analytics database.

Pipeline Aggregations

Pipeline aggregations operate on the output of other aggregations, enabling advanced analytics like moving averages, cumulative sums, and derivatives for trend analysis.

ELK Stack: Elasticsearch, Logstash, Kibana

The ELK Stack is the most popular open-source solution for log management and data analytics, combining three powerful tools into an integrated platform.

Logstash

Logstash is a data collection and transformation pipeline. It can ingest data from diverse sources (files, databases, message queues), apply filters and transformations, and ship the processed data to Elasticsearch. Grok patterns allow parsing unstructured log data into structured fields.

Kibana

Kibana is the visualization layer for Elasticsearch data. It enables creating dashboards, charts, maps, and tables that make data comprehensible. The Discover tab supports ad-hoc searching, Visualize enables chart creation, and Dashboard combines multiple visualizations into unified views.

Beats

Beats are lightweight data shipping agents. Filebeat (log files), Metricbeat (system metrics), Packetbeat (network data), and Heartbeat (uptime monitoring) are the most common variants. Beats can ship data directly to Elasticsearch or through Logstash for additional processing.

Performance Optimization

Shard Count: Since each shard is a Lucene instance, too many shards degrade performance. General rule: keep each shard between 10-50 GB.
Bulk API: Always use the Bulk API for batch indexing operations to minimize network overhead.
Doc Values: Enable doc values for fields used in aggregations and sorting.
Filter Caching: Queries in the filter context of bool queries are cached automatically for faster subsequent execution.
Index Lifecycle Management: Define ILM policies to automatically manage old indices through hot, warm, cold, and delete phases.

Conclusion

Elasticsearch provides a powerful and flexible solution for modern application search and analytics needs. Its inverted index structure enables millisecond-level search across massive datasets, its aggregation framework supports complex real-time analytics, and the ELK Stack integration delivers comprehensive log management capabilities. With proper mapping design, appropriate analyzer selection, and performance optimization, you can unlock the full potential of Elasticsearch for your applications.

Elasticsearch Guide: Full-Text Search, Inverted Index and ELK Stack