Anomaly Detection at Monitorama

Figuring out when something has gone wrong with your app or site is extremely difficult. Anomaly detection was a major theme among speakers this year at Monitorama, an open source monitoring conference. You can create trends based on historical data means trends, and those trends can be extrapolated into predictions of traffic patterns. When live traffic deviates from the prediction, you can try to detect if it is a true anomaly or not.

One of the hardest problems in anomaly detection systems is trying to avoid false positives — you don’t want to be woken up at 2 a.m. to fix a problem when nothing is actually broken in the product. This often leads to a phenomenon called “alert fatigue” where the on-call developer ignores noisy notifications, allowing for real events to sneak through undetected.

It is important to create actionable alerts around whether or not work is getting done. A spike in CPU is not directly actionable, whereas a 95 percent full hard drive can be fixed immediately.

The simplest approach to detecting problems in production is setting a threshold and then setting a notification to go off if traffic dips below that threshold. Seasonality and daily traffic patterns make this annoying to maintain and create a lot of false positives. Methods used in the financial industry for detecting stock market trends can be repurposed for detecting anomalies in production traffic.

The Fast Fourier Transform decomposes a graph into the series of frequencies that it is made of. A low-pass filter can remove high frequency noise providing a stable comparison for historical data. The Kolmogorov-Smirnov Test takes two cumulative distribution functions (CDF’s) and calculates the maximum distance between them. Transforming a section of a time series into a CDF is relatively straight forward. Setting a threshold for this distance will be more stable, and again tries to avoid the impact of noise.

When it comes to monitoring production systems, collecting the data is a first step, but analyzing it is the important part.

bcook@tagged.com'

Barrett Cook

Barrett Cook is an engineering manager on the web team at Tagged.

More Posts

Follow Me:
Twitter