This topic is very interesting to us and we have spent a lot of time trying to get it right. We have gone through a number of architecture designs, while we were maturing with the Elasticsearch product. Needless to say, this topic is very complex and just because something works for us doesn’t mean that it will work for you. So with that in mind, let’s delve into this topic.
A lot of people hear stories about Elasticsearch and how easy and straightforward it is to get going. And it really is that easy. Most will have an Elasticsearch installation running after an hour or so. It’s just a matter of following the excellent documentation at Elastic’s website.
So once you installed the first node , you want to add a few other nodes, still simple operation.
You end up with for example a simple 3 node cluster, where all nodes holds the same roles. All are master , data, kibana and logstash. And that is perfectly fine , if you are just trying out stuff.
However. As you deploy more of these small clusters for various usecases , you tend to get frustrated by this approach as these small cluster can’t talk to each other. You can’t search all your data from one single location. And you need to maintain and monitor each cluster.
So next logical approach is to try and go bigger and follow recommendations from Elastic. So you add dedicated masters nodes , dedicated Kibana nodes, dedicated data nodes for hot and warm. This is cool, we can extend this cluster unlimited and put all our data in this huge cluster. And you probably can, but you probably dont want to.
Somewhere down along the line, the masters will have trouble keeping up. Some usecase will end up influencing the other usecases running in the same cluster. Upgrading the nodes becomes a pain. If bad stuff happens to the cluster, everything is gone. And maybe you will feel a need to rethink the architecture.
So we did that..
What we came up with is a simple modular approach, where you will get these benefits:
- Search all your data from one place
- Centralized Logstash pipeline management
- Centralized Cluster Monitoring
- Centralized Machine Learning
- Usecase/workload isolation
- Simple maintenance
- Unlimited scalability
Through the power of Cross Cluster Search, you can easily build this architecture. If you have not heard of CCS , I encourage you to read the docs. It is a faily new feature, that allows you to connect to a cluster and search remotely connected clusters. Completely transparent. Your users dont need to know or care about, which Elasticsearch cluster, their data is located in.
Only caveat is that it is a licensed feature, so you will need a subscription from Elastic in order to deploy this topology.
So the topology is basically a frontend cluster with these resources:
- Master nodes
- Kibana nodes
- ML nodes
- Small data nodes , no end user data
Then you build a number of usecase clusters or backend clusters.
- Master nodes
- Coordinator nodes – if required, depends on size
- Hot Data
- Warm Data
This architecture isolates the clusters from each other and thus the workload. You can easily do rolling upgrades without affecting the whole setup. You could even run with different versions of Elasticsearch in the clusters. However that is not something we would recommend.
You will need to install Kibana in the backend clusters also as the developer tools don’t work across clusters yet. This is purely for administrative purposes, no end user access. We are hoping that remote connect in developer tools will be available in a future release of Kibana.
Whenever you have a need for a new usecase, that doesnt play well with the existing backend clusters, you simply deploy a new backend cluster and plug it into the frontend. All your existing usecases will continue running completely unaware of this added workload. The data will be instantly searchable from the same Kibana locations, that your user are accustomed to.
In regards to the Logstash nodes. You install a group of them and configure them to connect to the frontend cluster for their pipeline management. In the output section of the actual pipelines , you target the specific backend cluster. Typically the coordinator nodes in the backend cluster.
Your end users connect to Kibana in the frontend cluster and searches transparently all the backend clusters, that they are allowed to. The roles that you assign them in the frontend will be carried with their search requests to the backend clusters.
Hopefully you will have the Kibana setup with Spaces, so it gives your users a more structured view of all the data availabke to them. And you as the admin , can monitor the state of your entire Elasticsearch setup in one place.
As we started out. Elasticsearch architecture is complex and there is no size fits all. We hope this article gave you some ideas on, how to build a highly scalable and modular Elasticsearch setup allowing you to take advantage of the power of Elasticsearch.