Despite all of the buzz focusing on newer NoSQL data stores, relational and SQL-based databases are still alive and well. In fact, almost every customer we work with has some MySQL, PostgreSQL or MS SQL Server in their environment alongside their MongoDB, Redis, or Elasticsearch. It’s not uncommon for us to get requests on the easiest way to replicate data from a relational database into another datastore either as a migration, or simply to add features like full text search or visualization to their relational data. Luckily, Elasticsearch makes this super easy.
We’ve recently written a detailed white paper on Connecting Relational Databases to Elasticsearch that describes the process and the decisions that need to be made, but if you’d like a quick introduction and overview, read on below.
Selecting the Right Toolset
There are plenty of migration/replication tools available on the web, and even writing your own code in this scenario would not be a huge effort, but by far we’ve found Logstash and the jdbc input plugin to be the best solution.
Logstash is a swiss army knife style ingester, transformer, and shipper of data. Also, given that it’s part of the Elastic Stack, its ability to easily ship data to Elasticsearch is second to none. That takes care of the shipping to Elasticsearch part, but what about grabbing the data from your relational database?
One of the many inputs Logstash offers is a JDBC input. JDBC is a standard that defines how java applications can access a database. As long as a jdbc-compliant driver is available for a database, the JDBC Logstash input can use it to extract data using standard SQL queries. Given the broad support for free JDBC drivers among relational databases and even support for some non-relational databases, it’s an easy path to connecting other databases with Elasticsearch.
Making the Connection
As long as you have a jdbc compliant database and a compatible driver, the configuration is as easy as configuring Logstash with a few pieces of information:
- JDBC configuration information
- Credentials and connection info for your relational database
- A SQL query or statement for your relational database that extracts the data you want
- Credentials and connection info for your Elasticsearch database
Given that info, just shuffling data into Elasticsearch is extremely easy.
Asking the Important Questions
When it comes to replicating or migrating your relational data to Elasticsearch, getting the data from point A to point B is only half the battle. A larger and more thought-provoking question is how you want to model that data once in Elasticsearch. The problem with relational data in Elasticsearch, is frankly the relations.
Though Elasticsearch is first and foremost a search engine, the data model is essentially a document store with only minimal support for relations between documents. Because of that, you’ll need to make some decisions about how you want to model your data in Elasticsearch.
The three standard options that are generally presented are:
- Denormalize the data: Basically remove all of the relations and create a unique document for every combination of the data in your tables
- Arrays within documents: Elasticsearch supports the ability to create an array of objects within each document and even offers a special “nested” data type that allows some queries to evaluate each member of that array as a separate document
- Parent Child relationships: The final option, and closest to a true relational database is the Parent-Child relationship in Elasticsearch, which allows you to specify that certain documents are children of other documents.
Though Elasticsearch gives you some options for dealing with the data, it’s really up to you which method works best and meets your needs.
Another key decision is how much data and how often to replicate into Elasticsearch. If you’re migrating, the answer is easy, since that is a one-time dump of everything. However, if you want to use Elasticsearch as a secondary reporting datastore, you have additional options for how much data you copy over and on what cadence. You can copy full snapshots on a regular schedule, or use some facilities in the Logstash JDBC input filter to filter on certain columns and only send new data when it comes available. Once again, this will all depend on your data and what you’re using it for.
Decisions and More Detail
The bottom line is that the tools and processes are available NOW to connect your datastores and are actually pretty easy to put into practice with your data. There will be some tradeoffs with respect to how your data is modeled in Elasticsearch, how you can query it, and how you can visualize it. For example, choosing to denormalize your relational data will have impacts on the size of the resulting data set, how you would perform certain aggregations, and what visualizations are available to you.
In order to illustrate the process and how to make those decisions, we’ve written a white paper: Connecting Relational Databases to Elasticsearch, that outlines the standard process to set up relational to Elasticsearch replication for a sample data set, what impacts the different modeling decisions have on the resulting data, and finally some guidelines on how to choose which is best for you.
We hope you find it useful! As always, you can get more information from us by reaching out to firstname.lastname@example.org.