Simple Kafka and Elasticsearch integration

Kafka is a awesome platform for moving data around. It is often used together with an Elasticsearch cluster in order to host data before data gets ingested into Elasticsearch.

Kafka deals with topics to carry some specific data around. Imagive having topics for dns,dhcp, firewall and so. You can quickly end up with a high number of topics, right?

So in this blog post , I will present a way for you to utilize a single Kafka topic to carry many kinds of data, while still being able to ingest into different Elasticsearch indices. It also enables you to specify the rotation of the indices: rollover,weekly,daily or what your needs may be.

The approach works by creating a few simple properties alongside your data:

  • myapp
  • myrotation

Lets use this scenario. You have some kind of logfile, that contains log data for your app “blog-hits”. Your app is low volume in terms of log and you just need a weekly index.

You install filebeat and add these entries to your filebeat configuration.

  fields:
    myapp: blog-hits
    myrotation: weekly
  fields_under_root: true

You would then configure Filebeat to send this to Logstash for further parsing. After parsing Logstash sends to Kafka on a topic called “application-logs”, which you have configured on your Kafka servers.

If you prefer, you can also add the myapp and myrotation fields in the Logstash parsing your data. It is just a matter of preference.

You will have a Logstash consumer of topic “application logs” in a pipeline like this:

input
{
  kafka  {
                 bootstrap_servers => "kafka01:9092,kafka02:9092,kafka03:9092"
                 topics => [ "application-logs" ]
                 codec => "json"
                 group_id => "tier1"
                 decorate_events => true
  }
}

Please notice , that I used decorate_events. This is important for the rest of the pipeline.

Next we will define the filter section:

filter {
    mutate {
        copy => { "[@metadata][kafka][topic]" => "kafkatopic" }
    }

   if ![myapp]
   {
     mutate {
       add_field => { "myapp" => "default" }
     }
   }

   if ![myrotation]
   {
     mutate {
       add_field => { "myrotation" => "weekly" }
     }
   }
}

In the filter, we make sure , that we have default values for myapp and myrotation. Now we get to the interesting output section:


output
{
      if [myrotation] == "rollover"
      {
                  elasticsearch {
                                   hosts => ["https://elastic01:9200" , "https://elastic02:9200"]
                                   manage_template => false
                                   index => "%{[kafkatopic]}-%{[myapp]}-active"
                   }
      }

      if [myrotation] == "daily"
      {
                   elasticsearch {

                                   hosts => ["https://elastic01:9200" , "https://elastic02:9200"]
                                   manage_template => false
                                   index => "%{[kafkatopic]}-%{[myapp]}-%{+YYYY.MM.dd}"
                   }
     }

      if [myrotation] == "weekly"
      {
                  elasticsearch {
                                   hosts => ["https://elastic91:9200" , "https://elastic02:9200"]
                                   manage_template => false
                                   index => "%{[kafkatopic]}-%{[customapp]}-%{+xxxx.ww}"
                   }
      }
  }

In the output section, we use the information gathered from myapp and myrotation in order to ingest our logs into an application specific index. So this pipeline is just being used to route the data to the correct index.

In this case , data will get stored in “application-log-blog-hits-2019.14”.

You can use this simple approach to carry many different kind of data in a single Kafka topic, while still ingesting to a separate index in Elasticsearch.

Scaleable syslog pipeline

If you are receiving syslog data from a variety of network devices, you need a design that will allow you to receive and process syslog messages before you ingest them into your Elasticsearch cluster.

Processing syslog messages can be quite heavy in terms of CPU usage, if you are doing a lot of grok statements.

As always, this can be done in many different ways, but in this blog post I will show the basics of a Kafka based architecture.

Initial approach, that will work in many usecases: just put some kind of loadbalancer in front and use that to receive your syslog messages and ship them to some Logstash instances for processing.

This approach will be fine for a small to medium sized setup. But how will you scale this approach? Well , deploy one more Logstash server and edit your loadbalancer configuration to use the new Logstash server. But there is a smarter way.

I suggest that you have your loadbalancer forwarding to 2-3 Logstash servers. You create an extremely simple syslog pipeline. In this syslog input pipeline do absolutely nothing but forward the data to your Kafka cluster.

input {
  tcp {
    port => 514
    type => syslog
  }
  udp {
    port => 514
    type => syslog
  }
}

filter {
}

output{
  kafka { 
    bootstrap_servers => "kafka-01"
    topic_id =>  "raw-syslog"
  }
}

Of course since this is syslog be sure, that this pipeline is backed by a persistent queue in Logstash as syslog is send and forget.

The boxes to run this pipeline can be quite small as there will be no processing going on.

If you are running with RSyslog, you could even configure the RSyslog to send directly to Kafka and you won’t need this Logstash input pipeline.

But right now, you have just raw syslog messages living in your Kafka cluster. You need to process them. They could be ASA firewall messages, where you need to parse them.

So you create an additional Logstash pipeline, that pulls the raw syslog data from Kafka and parses it. This pipeline should be running on other boxes than the one, that received the data. Preferably some quite beefy Logstash nodes. When parsed you can send it back to Kafka, if you need or you can ingest into your Elasticsearch cluster at this point.

kafka{
	group_id => "raw-syslog-group"
	topics => ["raw-syslog"]
	bootstrap_servers => "kafka-01:<port>"
	codec => json
}

filter {
    if "%ASA-" in [message] {
      grok {
        match => [
          "message", "<%{POSINT:syslog_pri}>%{CISCOTIMESTAMP:timestamp} %{SYSLOGHOST:sysloghost} %%{CISCOTAG:cisco_tag}: %{GREEDYDATA:cisco_message}"
        ]
      }
      syslog_pri { }
  ....
  ....
  }
}

elasticsearch {
 hosts => ["elastic-01:9200"]
 index => "syslog-%{+YYYY.MM.dd}" 
}

The trick is the group consumer feature of Kafka. In the example I specified group_id => “raw-syslog-group” . So no matter how many Logstash instances have this pipeline running, they will be working as a unit in regards to Kafka.

If you find you need more processing power , deploy an additional Logstash node and deploy this pipeline. You dont have to change your loadbalancer configuration at all.

This setup also makes your life easier , if you can centralize your Logstash processing to a few beefy Logstash nodes. Comes in handy if you are thinking of using Memcached for lookup of malicious IP’s or domain names in all your syslog messages. Hey , that sounds like a topic for a complete blog post of its own;)