Lately, I've looked at the considerations that go into acquiring a network monitoring solution. Enterprises with increasingly complex networks need automation to help out the administrators who are tasked with their security and maintenance. We've looked at examples of buying or renting one of these systems. Now it's time to see why anyone would want to build their own.
Cloudant is a company that provides DBaaS (database as a service). They run clusters of cloud machines for big data work. Cloudant has special monitoring needs. Buying a general purpose monitoring system or renting a monitoring service would have left them with an awful lot of customization to do. So much that they decided to build a home-grown monitoring system from open source parts.
Cloudant's monitoring system produces a terabyte of data each day. This is shipped from the many IaaS providers used by Cloudant to a central location. Benjamin Anderson, engineer at Cloudant, built the first iteration of their monitoring system using many components:
- the open source components Sensu, collectd, Riemann and Graphite,
- the SaaS alert service PagerDuty, and
- a lot of home-grown plug-ins.
Why did Benjamin go through the pain of building a monitoring system? Why not buy a system off the shelf, or rent a cloud monitoring service?
Benjamin described the monitoring challenges they faced. "We don't have entirely unique challenges but we have some. We run hundreds of machines that basically all look the same - isomorphic clusters…Managing hundreds of machines that are all designed to look and operate the same is not something that's expressed well in legacy monitoring tools. That's why we had to write our own."
Benjamin expanded on why building new worked better for Cloudant than buying legacy. "The tools you buy are quite poor. OpenNMS, tools like that, are not aging well".
So why are their chosen tools better? "Now, there's a lot of activity in these open source projects - Riemann, Sensu, CollectD - and there's a lot of experimentation going on in the communities. Also, they're designed for modern systems. We're provisioning/decommissioning machines quite rapidly, and these old tools don't deal with that well".
Cloud-scale is also a challenge. Cloudant manage thousands of nodes across many IaaS providers. "Given our current implementation, we push the limits of what a lot of these open source tools do. If we used legacy tools, we'd probably be in a lot more pain".
Benjamin came up with a two pronged-system for Cloudant.
- Sensu keeps an eye on Cloudant's infrastructure (predominantly physical servers tuned for database performance). Sensu is an open source monitoring framework, very much like Nagios.
- A string of other open source tools keep an eye on the database services.
Data is collected from each host, sent to a monitoring cluster, put into a data store and graphs are created for dashboards.
A Cloudant host is part of their DBaaS, like an AWS instance running DynamoDB. Benjamin said they collect a lot of time-series data from each host. "We collect around 2,000 unique metrics on each host. We take all of those - and these are just numbers, floating point numbers - we record these every 10 seconds and ship them off to a monitoring cluster." A lot of these metrics come directly from the database, produced by the Erlang VM and within other parts of the database. System metrics are collected from lower down the technology stack, but metrics from the application get the custom attention.
A system daemon called collectd does the collecting, basic reporting and transport. Most of Cloudant's data is handled by their custom collectd extensions.
Benjamin said collectd's advantages are "it's easy, and has encryption. Data's all encrypted when we send it on the wire".
The cloudant engineers use the Riemann stream processing tool to act on events. Cloudant have written tools on top of Riemann to analyze data, notify operators and forward to their historical data store.
Cloudant stream data to PagerDuty, the alert SaaS company, who take care of contacting the 24/7 on-call staff.
Cloudant stores 45,000 data points per second in Graphite. Graphite is a historical data store. The data store takes up half a terabyte on disk, in a tight time series data format. Cloudant administrators can query graphite with questions like "Six months ago, what was the average utilization on x cluster". Benjamin said graphite is easy to use: "We have people who are not developers going in and accessing the data, which is a big win for us - getting less technical people accessing the data on these systems".
Graphite also graphs the data (spot the clue in the name). Cloudant use the graph renderer to feed the dashboards administrators use.
Cloudant engineers went to a lot of effort to build a monitoring solution rather than buy or rent one. Building your own service means investing significant time. Cloudant have been developing their solution for the past two years, and there's still active development going on.
But for Cloudant, a rock-solid customer service is essential. And that means making no compromises with the monitoring.
Nick Hardiman builds and maintains the infrastructure required to run Internet services. Nick deals with the lower layers of the Internet - the machines, networks, operating systems, and applications. Nick's job stops there, and he hands over to the designers and developers who build the top layer that customers use.