Tagged : site operations

in Tech

Citrix NetScalers and the netscaler-tool

The netscaler-tool is a Python script that leverages the Citrix NetScaler Nitro API. If you need to easily discover when something goes wrong with either the NetScaler or its services, you can use netscaler-tool to integrate NetScaler statistics into your existing open source monitoring and alerting projects.

In most large-scale web companies, load balancers (such as the Citrix NetScaler) allow for server failure and traffic increases without any harm to services or the need to page anyone. Keep in mind that this is only true when capacity planning was done correctly, to allow for absorption in these types of events. Eventually, the dead servers will get fixed and/or more capacity will be added to tiers if they are running hot, but that can all be addressed at leisure.

Before the netscaler-tool, monitoring and managing of NetScalers might have consisted of using SSH with an empty passphrase for the private key, creation of a Linux system role user, and parsing output. Some maybe even utilized SNMP. But these are old methods that have been superseded by APIs. Why live in the past and make things harder than they should be?

The NetScaler offers a few management methods, including several APIs (SOAP and Nitro). The SOAP API seemed like an improvement over SSH, but usability was not trivial. Even if you figured out how to code around the API, you might need to generate new NSConfig.wsdl or NSStat.wsdl files from time to time, in order to add a new feature and/or decrease compilation time and program size. Luckily, the Nitro API is much more user-friendly, where messages are in JSON format and there is no need to tweak anything on the NetScaler.

Using the netscaler-tool, anyone can currently monitor/alert on:

  • Node failovers
  • SSL certificate expirations
  • Differences between running and saved configurations
  • Surge queuing on backend services
  • Critical events (e.g. Packet CPU increase, hardware failure)

Feature releases will also allow fetching any of the NetScaler node statistics, which include:

  • HTTP request/s
  • Inbound/outbound IP packets/s
  • Inbound/outbound throughput
  • Current client connections

Here are some statistics from one of our NetScaler pairs that would benefit from this future release. Note: these statistics are based on both internal and external traffic:

  1. HTTP requests/s: ~76k
  2. Inbound IP packets/s: ~990k
  3. Outbound IP packets/s: ~920k
  4. Outbound throughput: ~4.8Gb/s
  5. Inbound throughput: ~4.8Gb/s

Although the tool is currently geared toward fetching data from NetScalers, it can also change a few NetScaler objects; for example, services and load-balancing virtual servers. These abilities enable operators to gracefully stop and start tiered services without affecting clients. One use case could be new web releases. Assuming orchestration code has already been written for code deployment and bouncing of web servers, one could leverage the tool to:

→ disable a web service with a delay to give clients enough time to complete their current request,

→ stop the web server,

→ start the web server,

→ make another call to the tool to enable the service,

→ and then move onto the next web server.

If, for whatever reason, you decide to write your own tool instead of using the netscaler-tool, and don’t want to deal with the Nitro API, you can use the included netscalerapi module.

For more information and source code, see the <a href=”https://github.com/tagged/netscaler-tool”>Github</a> repository.

Using netscaler-tool, we’re able to submit data to Graphite and manipulate NetScaler performance statistics to build better monitoring and alerting.

The netscaler-tool allows us to keep tabs on the largest traffic generators that impact our backend and frontend services.

Alex Kaplan contributed to this post.

Brian Glogower

Brian Glogower is a senior systems administrator on the site operations team at Tagged

More Posts

king leonidas pretty pissed

Spartan: How a Team of 6 Manages 1,000 Servers

At Site Operations at Tagged, our goal isn’t 99.999% uptime; it’s enabling developers to make changes fast. Although we want to help our developers try out new things as quickly as possible, as many seasoned IT professionals know, changes make systems break.

To cope with this, we’ve developed an internal IT culture that helps us manage a network of over one thousand servers with just six people with minimal burnout. This post will be the first in a mini series on how we run one of the world’s largest websites with a minimalist approach.

At the core of our approach is how we play it fast and loose with the SiteOps rotation, literally. We have three people who are in rotation between managing user requests, handling server events, and downtime to work on projects. A typical request might be to deploy a new version of an app to the servers and to facilitate the developers in development. This is a DevOps kind of role, where the developer and the Request Manager coordinate development from the inception of a project. Events covers managing the pager, handling broken servers and hardware, troubleshooting and problem solving. Projects covers long term projects, many of which are automation oriented.

To prevent burn out, the rotation is very fast–one week for each role. It’s very rare for two major incidents to happen in one week and that prevents us from getting that “Oh no, not again!” feeling that is so well known to SysAdmins. The schedule is also very loose where we horse trade roles, tit-for-tat often. When a SiteOps engineer is tired and in danger of burn out, we’re very proactive about stepping in and taking over the role for a few hours or a day when needed. Working in a startup gives us this flexibility without complicated policies.

Another technique we use is constant automation. Aside from our long term quarterly goals, our SiteOps engineers are encouraged to automate everything that takes up a lot of time. The Projects week is a distraction free time for doing analysis of the previous two weeks and working on concentrated development. This could be automating routine user requests, for example creating a user provisioning tool that HR and hiring managers can use, or automating taking a thread dump and restarting a Tomcat server when it runs out of memory. We also integrate automation into requests, all our developer requests for changes on servers are managed with Puppet where possible. By guaranteeing the time, even though it’s in fast rotation, it helps us focus on quick solutions rather than complicated ones that can’t keep up with the fast pace at Tagged.

While some of the aspects of loose communication don’t always scale to larger organisations, this setup enables us to grow our site without the need for a large team. Fortunately, growth of our site has been a steady incline which gives us plenty of time for planning. As the number of users and apps go up, by focusing on these two techniques, we’re able to improve the site while reducing the amount of human effort per server unit. This enables the entire SiteOps team to work with the systems at a very high level approach and it’s a great intellectual challenge.

Make everything as simple as possible but not simpler. -Albert Einstein

Yaakov Nemoy is a Systems Administrator at Tagged.