How Elasticsearch transforms data

By October 17, 2018 August 18th, 2022 No Comments

The ability of Elasticsearch to not only search but also analyze and visualize data (with Kibana) has made it one of the most popular open source search products in the world. It is used by companies in a wide range of industries to analyze business information and turn it into actionable data that can be used to boost sales and profits, but how does Elasticsearch do it? Let’s dig a little deeper into how Elasticsearch stores and uses data to provide this functionality.

Indexing, not storing data

Being able to quickly search your data is the key to all of the functionality that Elasticsearch provides. To make sure that everything is quickly searchable, Elasticsearch doesn’t just store your data somewhere safe, it does a whole bunch of work up front to create a searchable index of your data.

What’s an index?

Well, it’s a lot like what you’d find in the back of a cookbook; It tells you which page to find information you’re looking for based on key words. This is at the very core of Elasticsearch and is why the main grouping of data in Elasticsearch is called an index (not a database or table) and the actual act of storing something in Elasticsearch is referred to as indexing it (not storing, saving, inserting, etc.).

Indexes in Elasticsearch

To be more specific, Elasticsearch creates what’s called an inverted index. An inverted index is similar to the index that you’d find in the back of a book. In the case of Elasticsearch, it is basically a list of all of the words that exist in your Elasticsearch data and which documents each of those words lives in. This structure is generated when the data is indexed into Elasticsearch, so that it can find what you’re looking for extremely quickly.

Elasticsearch can further optimize search when it splits your data into the words/keys that show up in the index. It can normalize uppercase and lowercase so that a search for “rocket” would match “Rocket”. It can remove things like plurals (aka stemming) so that a search for “rockets” would match “rocket”. It can even handle things like synonyms, so that a search for “projectile” could match “rocket”. All of this is possible and delivered quickly because of the way Elasticsearch stores your data when you index it.

Optimizations for analytics

When it comes to things like analytics, search is important for finding the data you want to work with. However, it’s not very helpful for consolidating, slicing, or dicing the data. In Elasticsearch, those consolidating or bucketing functions are called Aggregations. Once again optimizing speed for aggregations (and a few other things), Elasticsearch uses something else on disk called doc_values. Since the search part of the process (“Look at data between date X and Y”) has already quickly found the documents I’m interested in, functions that want to aggregate the data in those documents are much faster if they can just find values based on their originating documents. This is what doc_values provides; a fast data structure that is stored on disk at indexing time that can be loaded in memory to make aggregations super fast.

Powerful queries and rich visualizations

All of those optimizations under the hood help Elasticsearch provide insanely quick querying and visualizations. On the query side, Elasticsearch can do simple text and keyword matching. It can also extend way further, enabling spell correction, phrase searching, autocompletion, and “more like this” types of queries. Though some of these query options are available in other types of datastores, Elasticsearch can potentially make them easier to learn, but absolutely makes them perform faster. All of that work up front when you index the data enables speed later.

One of the other benefits of the Elastic Stack is access to Kibana, which enables rich visualizations for your data in an open source toolset. The magic that makes Kibana so powerful is the search and aggregation functionalities of Elasticsearch behind it. Two key components for visualizing data are the ability to focus on a subset of the data and then group that data in some way to show a trend or decomposition of some aspect of the data. That’s basically search and aggregations respectively. Kibana is surely providing the sizzle, but Elasticsearch is the steak.

The downside of all this goodness

All of this sounds amazing. There has to be a catch… and there is. The first of these is that Elasticsearch is not completely unstructured like some other NoSQL datastores. In order to perform the optimization it does up front, you lose some flexibility down the road. Once you create an index, it’s difficult to change key aspects of that index without fully reindexing. Once you decide that you want a field to be an integer, you can’t change it to a floating point number later. These are smaller limitations, but they can bite you.
All of the text analysis, inverted indexes, replication, and doc_values take up space on disk. With the default settings, you can put 1GB of text into Elasticsearch and easily end up with 10GB+ of Elasticsearch indexes. Out of the box, Elasticsearch wants to make your data as searchable and as fast as possible. The cost is a whole lot more disk and memory.

The good news is that there are numerous solutions available for these challenges. Notice above that it specifically says “With the default settings”. There are a number of options Elasticsearch gives you to not index or analyze certain fields (reducing the inverted index size), not store doc_values for certain fields, eliminate redundant catch all fields (like the _all field which is disabled by default in 6.x+), or make global changes like enabling data compression on disk. These changes will surely impact your ability to search and speed, but if you know what you’re doing, they can help keep your data volume in check.

Get expert Elasticsearch support

ObjectRocket experts know a lot of tips and tricks for ensuring you get the most of all of the goodness while minimizing the downsides of Elasticsearch. We’d love to learn the ins and outs of your use case so you can make the right decisions for your application. Learn more about our managed Elasticsearch offering and spin up a free instance today.