Most professional sysadmins can probably relate to the excitement, and simultaneous strain, of keeping track of all the different technologies available for use. Not only do we have to decide how and what to add to our toolbox, but we also have to be on the lookout for what’s coming next.
Sometimes we’re lucky enough to bump into an old (technology) friend that’s grown and matured over the years, such as sFlow. I’ve used a number of Foundry (now Brocade) switches at different companies over the years and they all implemented sFlow. At each of these jobs I would send all my sFlow data to various collectors and I was constantly amazed at the power and versatility of this technology.
sFlow excels at doing things like showing you the “top talkers” on a network segment. It does this by sampling the packet stream and allowing you to see what it sees (this is much more efficient than trying to capture every packet). When you are able to adjust the amount of sampling you do based on the packet count you experience, you are able to handle much larger volumes of traffic with a high degree of confidence in your data. I always thought it would be great if I could get this level of visibility on my application tier, and now I can.
The sFlow community has been making great strides with Host sFlow, which takes some of the same great characteristics from the network sFlow standard and applies them on the host and application side. This means that you can find out which URLs are receiving the most hits, which memcached keys are the hottest and how these numbers correlate with what you are seeing on the network.
Setting up Host sFlow couldn’t be easier. First you can download packages for FreeBSD, Linux, or Windows from SourceForge. Once installed on Linux – when you start the daemon – it will check /etc/hsflowd.conf to find out where the sFlow collector(s) are located. This is where the daemon will send all the data. You can also set things like polling and sampling rates in this file. If you wish, you may also define these using the location services in DNS.
You will also need a collector. The simplest collector is sflowtool, which will capture the packets and present them to you in various formats all of which are consumable by your favorite scripting language. There are many collectors to choose from; at Tagged one of our favorite collectors is Ganglia!
As of Ganglia 3.2, Ganglia can understand and process sFlow packets. At Tagged, we have replaced all our gmond process with hsflowd.
One of the great things about replacing our gmond processes is that our monitoring infrastructure is now much more efficient. With gmond, every metric you measure sends a packet across the wire. If you sample every 15 seconds, it simply sends a packet every 15 seconds for each metric that you monitor. With hsflowd, you can sample every 15 seconds, but hsflowd will batch all those metrics up into a single packet and send those across the wire. We are actually able to collect more metrics, more often, with fewer packets.
On a big network like Tagged, anything we can do to lower our packets-per-second is a big win. The difficult part was converting from multicast to unicast. We took it as an opportunity to templatize all our puppet configs for this purpose, based on our CMDB. Now we have a system that we really love.
A Standard, Really
Perhaps one of the most difficult things to wrap our heads around was that sFlow would not be a replacement for our Ganglia or Graphite tools. sFlow is a standard on switches and it is a standard on the host side too. That does not mean you can’t instrument your own applications with sFlow, but this is not the default configuration for sFlow. If you are going to look at your HTTP metrics – whether they come from Apache, Nginx, or Tomcat – they are going to be the same metrics.
If you want to monitor things like the number of active users on your site, you can still do those things with gmetric or graphite. However if you want to be able to find out how many of your HTTP requests have 200, 300, or 500 response codes, and you want to be able to do that in real time across a huge Web farm (which makes log analyzers and packet sniffers completely impractical), then you want mod-sflow (for Apache).
Solves The Java JMX Problem
There are a few other things that have me excited about sFlow, like the fact that it solves the JVM monitoring problem. Ops folks always want to know how their Tomcat or JBoss servers are running. You can buy fancy tools from Oracle to do this, or you can use the jmx-sflow-agent. Typically, the way we solve this problem is that we either fire up a tool like check_jmx, which basically fires up a JVM each and every time it needs to check a metric *shudder*, or we run a long running java process that we need to constantly update with a list of servers to poll in order to get graphs of our heap sizes.
Alternatively you could run jmx-flow-agent, which runs as a -javaagent argument to the jvm command line, and have all your JVMs automatically send their metrics to a central location the moment they start.
When applications start up, they start sending their data via sFlow to a central location. There is no polling. This is the same model as all the next generation of monitoring tools like Ganglia and Graphite. This is cloud friendly.
Imagine you were Netflix running thousands of instances on EC2. Would you rather have to update your config file every few seconds to make your monitoring systems aware of all the hosts that have been provisioned or destroyed, or would you like new hosts to just appear on your monitoring systems as the hosts appear? At Tagged, we would be constantly updating our config files every time a disk failed, or when a tier was expanded or a new one provisioned. We would have to specify in the file which hosts were running Java, or Memcached or Apache, or both.
Instead, in our world, if an application is running on a host, we instantly see that application in our monitoring tools. Deploying mod-sflow to your apache servers is as simple as creating an RPM and putting a few lines in Puppet. Awesome.
sFlow’s relationship with the host side of the equation is just picking up steam now. We’ve been lucky enough to be at the leading edge of this, mostly through my giving my LSPE Meetup Talk at on the right day, at the right time. In the coming weeks, we hope to share more with the world about what we’re getting from using sFlow on our network – why we are loving it and what problems it’s helped us solve.
Dave Mangot is a Senior Systems Administrator at Tagged and you can follow him on his blog.