in Tech

Grabbing Full Java Stack Traces from Syslog-ng with Logstash

For a lot of companies, logging is a big deal. It can help developers debug code, site administrators troubleshoot malfunctioning servers, and identify symptoms of bigger problems. However, managing logs and interpreting them is a challenge particularly for big sites like Tagged where there are thousands of servers with hundreds of gigabytes of data per server.

Here at Tagged, we do apache logs with syslog-ng and Java logs with log4j. But with terabytes of data, dealing with increasing storage is a constant issue. Instead of constantly copying data to a larger disk or wiping data, we moved to an elastic storage model with ElasticSearch: an open-source, lucene-based, RESTful search engine. With ElasticSearch we are able to expand storage easily and quickly by simply building new hosts and adding them to the cluster. The beauty is that new host automatically discover the ElasticSearch cluster, load balance, and recover. The only problem: efficiently parsing syslog formatted data and getting it into JSON for ElasticSearch such that it is easily and quickly retrievable.

Enter Logstash. Logstash is a tool for managing events and logs. It gives users the ability to filter and transform data from multiple input sources and output them into different formats.

In today’s post, we’ll share how to get started with Logstash and also how took input data from syslog-ng and parsed it into JSON for ElasticSearch.

Diving into Logstash
We used Logstash as a parsing engine to handle a TCP connection with our syslog-ng server hosting our error logs. Using the TCP input, logstash can interpret syslog’s fields including the @source, @type, @timestamp, @source_host, @source_path, and @message fields.

Typically, syslog error messages look like this:

With this error message, Logstash is able to parse the message and create a JSON output for ElasticSearch. Here is a basic configuration file:

In this configuration file we see that things are broken down into inputs, filters, and outputs where Logstash is initially taking a TCP input on port 1514. The “type” field is simply a label used to identify operations on the same data. This becomes important when your configuration file consists of multiple inputs, filters, and outputs.

In the filter field we use a “grok” filter. A grok filter is a grep-like function that parses arbitrary text and structures it. In this case we operate on “all” input, assign a field name of “priority” to positive integers identified in brackets, and look for a pattern named “SYSLOGLINE”. Using %{SYSLOGLINE}, several fields are enabled including the priority, timestamp, logsource, program, pid, and message. These are grouped under @fields. What “%{SYSLOGLINE}” does is look at a typical syslog error message, parse it, and build a JSON file matching the data to the fields. The JSON will look something like this:

For outputs, we define ElasticSearch to use the same type we have been operating on (all). The cluster name (logsearch) and the index (stash) must be added so that logstash can join the ElasticSearch cluster. A host name is not required as Logstash is able to automatically discover and join the cluster.

Multi-lined Error Messages
While this sample configuration file works fine for single-line apache errors, some errors tend to be multi-lined such as Java stack traces. To parse the messages and to enable multiline support for Java stack traces, an additional “multiline” filter must be added.

Typically, Java stack traces can be grouped by finding a tab in lines following an error message. In our case, syslog-ng appends to each message the priority, timestamp, logsource, program, and pid. As a result, Java stack traces cannot be as easily recognized nor processed by identifying tabs on new lines.

Instead, the multiline filter must specify the pattern to recognize multiline stack traces. In our case we look for messages that begin with “at:” or contain the word “Exception” somewhere within the message and append them to the previous message sent.

Lastly, we set the stream identity to “%{logsource}.%{@type}”. The stream identity is used because all messages are coming from one input source and this allows the messages to be grouped by their specific logsource. In doing so we prevent messages from different log sources being joined in the same stack trace.

Here’s the full logstash configuration file:

Conclusion

Using Logstash and ElasticSearch, we are able to have separate, scalable tiers for our data logging systems. With Logstash, we can send traffic by app type to different Logstash instances and have them output to ElasticSearch. With ElasticSearch, we have a scalable. distributed back-end to store our logs that is also easily manageable.

With data stored in ElasticSearch, there are also several log management systems that can be used. After looking at Logstash’s internal log manager, graylog2, and Kibana, we settled on Kibana because it had a cleaner interface and was written in PHP meaning it can be extended by our own developers.

Got questions or tips to add? Drop us a comment below.


Justin Wong was a software engineering intern this spring as part of the SiteOps team.