Sensu – A Monitoring Framework
November 3, 2011
At Sonian, we monitor an ever-changing number of Amazon EC2 instances. As of this post, that number is 476, and will rise and fall before the day is done, but with the “elastic” nature of our infrastructure, monitoring EC2 instances is a not such a trivial task.
We have found the available tools from the community toolbox to be inadequate when operating in “the cloud.” Until recently, Sonian utilized several tools to monitor systems and collect metrics: Nagios, Collectd, Graphite, and Ganglia.
The Evolution of Nagios at Sonian
Our servers are grouped into “stacks”, providing isolated environments that are globally distributed. In the past, a Nagios server would reside in each one. The servers were responsible for monitoring the components of their stack, triggering notifications when something was amiss. Check coverage was gradually increased over time, as applications began to require more moving parts. As the number of stacks increased, a centralized view of the organization was desired. To appease the engineering teams, a distributed Nagios solution was created. The monitoring server in each stack would forward their check results to a central Nagios server running the Nagios Service Check Accepter. The central server ran the Nagios web interface, displaying the status of every client and service under our control. Notifications could only be triggered by the central server, making it easier to silence notifications for a client or one of its services.
Nagios is NOT purpose-built for the cloud. It expects your environment to be fairly static, with every aspect of it under your control. The initial release was May 14th, 1999. The concept of elastic Infrastructure as a Service (IaaS) didn’t exist at the time of its creation, and it has yet to adapt to the new paradigm.
Nagios’ inability to discover clients is an excellent indication of its antiquity. Nagios must know of every client, group, and service on start. When a new server is spun up, the Nagios configuration must be updated and the service reloaded in order to begin monitoring it. Configuration Management is commonly used in this case as a partial solution; using a method of server discovery, re-writing the configuration and then triggering a service reload. It’s not a complete solution as the process usually only happens on a set interval, or is too involved for frequent changes. Distributing Nagios in a tiered fashion only complicates this further, making it far more difficult to begin monitoring a new server or deploy new checks. The following diagram depicts a sample of events that would require Nagios configuration changes.
Our problems with Nagios
- Configuration is unpleasant and restrictive
- Cannot discover new servers on its own
- Easily overwhelmed with a high number of clients and/or checks
- Difficult to extend
A Brief Introduction to Sensu
Enter Sensu, a monitoring framework that aims to be simple, malleable, and scalable.
The Building Blocks
In this modern world of computing, we’re blessed with ever improving Configuration Management tools, such as OpsCode Chef and Puppet. These tools already gather the information needed to effectively and efficiently monitor your systems. Not only are these tools a rich source of data, but they can also handle the distribution of supporting libraries and plugins. Sensu was built with the intention of being paired with a CM tool.
Message-oriented middleware is commonly used by developers to decouple and distribute components of their applications. Sonian currently uses RabbitMQ for all sorts of job queues. For example, RabbitMQ allows Rails application to communicate to a backend written in Clojure, without any knowledge of its status or implementation.
Sensu uses RabbitMQ to securely route check requests and results, making it possible to scale out and back in on demand.
Open source key-value data stores have been around for a long time, recently gaining a lot of attention with NoSQL being all the rage. Redis is a very fast in-memory “data structure server” with keys that can contain strings, hashes, lists, sets, and sorted sets. Its support for atomic operations and ability to persist to disk has made it a common choice for new projects. Sensu uses Redis as a non-persistent database, to store client and event data.
The idea behind Sensu is simple, schedule the remote execution of checks and collect their results. As mentioned above, Sensu uses RabbitMQ to route check requests and results, and this is the secret sauce. Checks will always have an intended target; servers with certain responsibilities, such as serving web pages (web server) or data storage (elastic search). A Sensu client has a set of subscriptions based on its server’s responsibilities, and the client will execute checks that are published to these subscriptions. A Sensu server has a result subscription, this is where clients publish check results. Since each component only connects to RabbitMQ, there is no need for an external discovery mechanism, new servers are monitored immediately.
Sensu is written entirely in Ruby, using the EventMachine library for single process concurrency. This has produced a fully functional, clean, and small code base.
Sonian has chosen to make it publicly available on GitHub.
All configuration is done with JSON files, making it easy for Configuration Management and other automation tools to create and read them. The following are configuration snippets.
We hope this brief introduction into Sensu has spiked your interest. For the nitty gritty, please check out the GitHub repository, and jump on IRC (irc.freenode.net #sensu).