At Site Operations at Tagged, our goal isn’t 99.999% uptime; it’s enabling developers to make changes fast. Although we want to help our developers try out new things as quickly as possible, as many seasoned IT professionals know, changes make systems break.
To cope with this, we’ve developed an internal IT culture that helps us manage a network of over one thousand servers with just six people with minimal burnout. This post will be the first in a mini series on how we run one of the world’s largest websites with a minimalist approach.
At the core of our approach is how we play it fast and loose with the SiteOps rotation, literally. We have three people who are in rotation between managing user requests, handling server events, and downtime to work on projects. A typical request might be to deploy a new version of an app to the servers and to facilitate the developers in development. This is a DevOps kind of role, where the developer and the Request Manager coordinate development from the inception of a project. Events covers managing the pager, handling broken servers and hardware, troubleshooting and problem solving. Projects covers long term projects, many of which are automation oriented.
To prevent burn out, the rotation is very fast–one week for each role. It’s very rare for two major incidents to happen in one week and that prevents us from getting that “Oh no, not again!” feeling that is so well known to SysAdmins. The schedule is also very loose where we horse trade roles, tit-for-tat often. When a SiteOps engineer is tired and in danger of burn out, we’re very proactive about stepping in and taking over the role for a few hours or a day when needed. Working in a startup gives us this flexibility without complicated policies.
Another technique we use is constant automation. Aside from our long term quarterly goals, our SiteOps engineers are encouraged to automate everything that takes up a lot of time. The Projects week is a distraction free time for doing analysis of the previous two weeks and working on concentrated development. This could be automating routine user requests, for example creating a user provisioning tool that HR and hiring managers can use, or automating taking a thread dump and restarting a Tomcat server when it runs out of memory. We also integrate automation into requests, all our developer requests for changes on servers are managed with Puppet where possible. By guaranteeing the time, even though it’s in fast rotation, it helps us focus on quick solutions rather than complicated ones that can’t keep up with the fast pace at Tagged.
While some of the aspects of loose communication don’t always scale to larger organisations, this setup enables us to grow our site without the need for a large team. Fortunately, growth of our site has been a steady incline which gives us plenty of time for planning. As the number of users and apps go up, by focusing on these two techniques, we’re able to improve the site while reducing the amount of human effort per server unit. This enables the entire SiteOps team to work with the systems at a very high level approach and it’s a great intellectual challenge.
Make everything as simple as possible but not simpler. -Albert Einstein
Yaakov Nemoy is a Systems Administrator at Tagged.