Tuesday, March 6, 2012

Logfile Aggregation and Analytics

One of the more important concerns of any organization that runs any kind of a public website is to be able to troubleshoot the applications and websites they operate. A typical web application runs in the popular architecture of one or more web servers, a load balancer, a cluster of application servers and a database layer. Additionally, there may be other enterprise-level elements like a message broker, a web services layer, an ESB etc.
A common feature of this architecture is that most of the components produce operational log files. Some are out-of-the-box logs, while others are configured and written by each application. Another common feature is the operations group/production support teams striving to support these applications by parsing these log files to troubleshoot issues reported by end users, monitoring apps and internal test teams.
A typical strategy (which really is a lack of strategy) is for a person or a team to download log files from different locations into a central location (or perhaps the logs are already on a specific log file server) and running grep/find/vi commands to search for incidents based on a given timestamp and/or an error message expected to appear in the log file. Many of you reading this will probably be empathizing with this (because you have done the same in the past) while simultaneously shaking your heads pitying those poor souls (because you feel their pain). What then follows as the next logical step is for someone in your team to finally take the initiative to write a utility to search through files and display some sort of a report on a web page with some limited search capabilities. At one point, you will get tired of maintaining this search app and start looking for alternatives in the world at-large. A client of ours was at this point when they asked us to come in and assist in defining their log management strategy. The following blog is a high-level review of three options and a more detailed review of one of those options.
We considered 3 options – not with any particular bias to excluding the others out there. It just so happened that we had some references to the ones we chose. We will first present a couple of commercial solutions that operate on different principles and being commercial, have all the relevant documentation and instructions on their websites.
  • Splunk
  • Loggly
  • Logstash + Graylog2

Splunk

Splunk (www.splunk.com) is a commercial product available for free to download and install. As of this writing, Splunk is free up to 500MB indexed per day. Visit their website for latest pricing and licensing details. Splunk’s feature set has evolved over the years and now offers a rich feature set with a RIA interface to manage your Splunk instance as well as perform analytics and queries. The UI features are impressive and let you administer the instance on the one hand and manage and run queries on your indexed data on the other. Splunk stores incoming data as compressed files. This data is then indexed and stored alongside the raw compressed data for fast and easy querying. Splunk ages your data in 4 levels – hot (current), warm (recent), cold (older than recent) and frozen (typically very old and candidate to be removed from the index). You can also customize maximum storage sizes and schedule to age data as well as chose to move frozen data to backup rather than delete it as Splunk would do by default.
Being a commercial product, you can expect support and an expert knowledgebase behind the product offerings. There are also, out-of-the-box recipes to quickly setup your instance and manage. Splunk can run in a clustered environment for high availability and performance. Additionally, there is a concept of Splunk Apps targeted at common products that are candidates for being monitored – Apache web servers, F5 devices, etc. The query language is reasonably intuitive and is custom but familiar if you have used an indexing engine before.
In my test, I saw the following storage compression ratios:
  • 14 MB - compressed to 1MB raw and 5.7 MB indexed
  • 55 MB - compressed to 5.7 MB raw and 15 MB indexed
Splunk offers a range of products in addition to Log Management – Network Management, Security and Compliance, IT Operations etc.

Loggly

Loggly (www.loggly.com) is another commercial product that offers cloud-based solution for Log management. There is no software to install and manage since it is all hosted by Loggly. You have to configure your applications/infrastructure to forward logs to Loggly’s server using one of their prescribed methods – TCP, HTTP, File upload etc. At the time of this writing, Loggly is free upto 200MB indexed data per day with a 7 day retention policy. The latter is in focus since you upload all data to Loggly and that becomes the basis for your queries and searches. Loggly offers a maximum period of 90 days retention after which you can use Loggly's archive feature to store them in your own S3 bucket. Loggly offers secure logging via TLS/HTTPs for all your data uploads. A rich web interface allows you to manage/administer your account and run queries. The query language is reasonably intuitive and is custom but familiar if you have used an indexing engine before.
When you sign up, you setup a sub-domain with loggly.com that becomes your log management site. You can login, and administer “your site” by creating users, input methods etc. Because there is no software to install, there is no collection agent out-of-the-box; you will need to proactively send data to Loggly using one of their prescribed input methods. Loggly lets you embed a javascript on your web application pages to send data to Loggly directly from the application layer.

Logstash + Graylog2

The third option is perhaps the most interesting yet. We look at two open source/free applications each of which is a complete solution in itself but ends up being better when combined to take advantage of their particular strengths.
Logstash (www.logstash.net) and Graylog2 (www.graylog2.org) are two products that play well together. Logstash can be your “backend” to monitor and collect data from different sources and Graylog2 can be the presentation layer with a richer UI than what Logstash provides.
Logstash is easy to install and run. It works based on a simple Input-Process-Output concept that is configured in a *.conf file. Input specifies which files/sources to read. Process section – formally called “Filter” – defines how the incoming data has to be massaged by applying regular expressions, rules etc. Finally, the output section will define where the data should end up. By default, the current version of logstash uses ElasticSearch engine to index data. You can either run an embedded version of ElasticSearch or point the output to an external instance, which could be a robust clustered instance. One of the options for output is to direct it to Graylog2 running as a separate process. You do this to take advantage of the richer UI features of Graylog2 which are clearly better than Logstash’s features.
Logstash has a variety of ways to read data – TCP, files, syslogs etc. If you choose to just point to your log files for Logstash to read and process, be mindful that this translates into “tail -0” on that file and hence data starts feeding into Logstash from the point in time at which your monitoring begins. Prior data is not read. There are ways to get around this easily but you have to do that manually. You can also run Logstash as a network of file readers (Logstash instances that only read and forward data) on different physical machines that are forwarding data to a single indexing Logstash instance which is responsible for processing data and indexing it for searching or forwarding data to Graylog2.
Graylog2 is another log management tool that Logstash interfaces with seamlessly. Logstash can output data as GELF (Graylog2 Extended Log Format) data to be consumed by Graylog2. For those who wish to aggregate and search application logs written by Java apps using log4j, there is an easy way to write GELF data using the gelf4j jar. You would define a gelf4j appender (https://github.com/pstehlik/gelf4j) in your log4j properties as shown below and voila, you have GELF data. GELF is a reader friendly format that captures log data in JSON format thus making is wonderfully easy to index and search. This makes your Logstash forwarders operate on GELF log data and your process layer should account for that. If you choose to only write using traditional log4j appenders, you would then use Logstash’s feature to convert to GELF before forwarding to Graylog2.

log4j.appender.GELF=com.pstehlik.groovy.gelf4j.appender.Gelf4JAppender
log4j.appender.GELF.graylogServerHost=my.gelf.host
log4j.appender.GELF.host=www13
log4j.appender.GELF.facility=local1
# use the GELF appender. set the level to INFO.
log4j.category.com.my.class.where.i.DoLogging=INFO, GELF

Putting it all together:
  • I chose to use a combination of Logstash and Graylog2. With this choice, you will also end up using MongoDB and ElasticSearch to support your log aggregation. I did not use a gelf4j appender.
  • I chose to monitor file in a location – in this case, where the log files were being written.
  • I chose to monitor logs from a Java App running on Tomcat v6 and Apache Web Server.
Install Mongo DB
Start the mongod shell and do the following:

mongo
use admin
db.addUser('admin', 'grayloguser-password')
db.auth('admin', 'grayloguser-password')
use graylog2
db.addUser('grayloguser', 'grayloguser-password')
db.auth('grayloguser', 'grayloguser-password')

Install ElasticSearch
Here, for my initial research, I have the option to use the embedded ElasticSearch that comes with the Logstash monolithic jar. I chose to use the embedded server and start it up with default settings, and so I did not need a separate install. ElasticSearch will automatically be started during Logstash startup at the default port.
Install Graylog2
After I followed the instructions and copied the sample configuration file, I had to edit the “/etc/graylog2.conf” file. *Note: You may need sudo access.
I updated the mongo db section to update the mongo db Graylog2 database, user id, password and host. I set the host to 127.0.0.1 instead of localhost on my MacBook Pro. The mongo db related setting should be the same as the parameters you used when you configured mongoDB in the “Install MongoDB” step above. Also, once you install the web interface, you will need to update the *.yml files in the “config” directory (specifically mongoid.yml for mongodb properties).
Download Logstash Monolithic Jar
Since I know the input to be files, I now need a way to read this input, transform it a bit to be really useful to me for search and display. Enter Logstash. I downloaded the monolithic jar file from www.logstash.net. (Note: While there, do make some time to watch the Logstash presentation by the author of Logstash.) Once the jar was downloaded, I then created a mylogstash.conf file in my text editor. You can read more about this on the Logstash website. My conf file looks like this:

input {
file {
type => "log4j-ml-55"
path => "/workarea/app40_CPU10_Logs/app40_CPU10_server8.log"
}
}
filter {
multiline {
type => "log4j-ml-55"
pattern => "^\[20"
negate => true
what => "previous"
}
mutate {
type => log4j-ml-55
replace => ["@source_host", "log4j-ml-55"]
}
}
output {
stdout {}
elasticsearch { embedded => true }
gelf {
facility => "logstash-gelf"
host => "127.0.0.1"
sender => "app40_CPU10_server8"
}
}

The filter is specific to the situation in my log file. Since I am collecting logs from a Java app, I need to make sure stack traces are treated as one incident/entry and the expression in the multiline filter pattern tells logstash that if a line does not start with “[20” character set, treat that line as part of the previous line and append it to that line. The mutate filter is interesting. Because my logfiles are from a centralized logging server, the remote host will point to that server. But what I really need as source host is the original host that generated the file and that is precisely what the mutate filter accomplishes.
Start it all up
Make sure MongoDB is up. If using a separate ElasticSearch instance, startup ElasticSearch. If using the embedded ElasticSearch, then start Logstash with the following command.
java -jar logstash-1.1.0beta8-monolithic.jar agent -f mylogstash.conf -- web --backend elasticsearch:///?local

Once Logstash is up and running, startup the Graylog2 server and Graylog2 web client in that order. Visit www.graylog2.org for details on how to start the server and web client.
You are all set. As your application and web logs get written, the pipeline of Logstash and Graylog2 will route it through to Graylog2 Web Interface and you should be able to run searches and analytics.
http://127.0.0.1:3000 is where Graylog2 web interface is available.

Conclusion

Log Monitoring is a critical tool in your troubleshooting arsenal. Based on the information you capture in your log files, it can also be a strategic tool for your company to perform analytics and searches, which in turn enables you to be a better customer-focused organization. Because the quality of your analytic depends on the input data, it may be a good idea to setup your monitoring using free/open-source software like Logstash+Graylog2 and then refine your application code to fine tune your logging input. This is not a feature that is functional on Day 1 of going live. You will have to spend time tuning your entire setup to arrive at optimal input data and output analytics.
Some things to consider:
  • Treat this as a project with a definite charter and expected outcomes
  • Involve business and IT ops to define the analytics output desired
  • Involve your development teams to ensure that the data captured in the log files will support the analytics requirements
  • Ensure buy-in from teams for continuous improvements
  • Prototype with free/open-source tools initially.
This will definitely include involving your business and IT stakeholders to define requirements upfront for analytics. Also, your development teams must dedicate some time to design and implement an effective logging technique to capture the required data in the required format. With a little effort upfront and commitment to continuous improvement, you will get to the Promised Land of log aggregation and analytics.