Even though it seems that data is driving everything nowadays, humans are still behind the wheel. Raw data neither interprets itself nor provides its own context. In all our talk about “data-driven decisions”, in business or anywhere else, we must remember that people ask the questions which are teased out of the numbers, and human eyes which ultimately dig into the graphs and plots.
This is especially true of many methods of outlier detection: the process of finding the unusual deviations in a data set which sometimes can tell a very important story. For an ecommerce site, for example, a single spike in online purchases can indicate hundreds of thousands of dollars in lost revenue due to a temporary price glitch. On the other hand, a similar spike in social media mentions of a product could be an opportunity to generate more revenue if an aggressive marketing campaign can capitalize on the free publicity and convert likes into buys.
In either case, these blips must first be found before either company act on them.
Like a surgeon: brief overview of outlier determining methods
For many data scientists, finding outliers in a data set isn’t as easy as clicking a “Find Outliers” button. Instead, determining which values in a data set are outliers is an iterative process, because outlier detection algorithms are much like surgery implements: specialized tools meant to be used by talented and trained people in a setting that prioritizes precision and accuracy over rapid results.
Why are algorithms for determining outliers like scalpels and bone saws? The answer is that in both cases, what you’re working with determines what tools you can use. Identifying outliers by some threshold distance from a cluster centroid, for example, is only appropriate for inherently two-dimensional data where natural clustering is to be expected (like a scatter plot of resident age vs. income for a university town). You wouldn’t apply k-means clustering to a time series.
And that’s just the tip of the iceberg: selecting the right outlier calculation method is multi-dimensional:
- Robust vs. limited: Some algorithms can handle different types of data sets, others only work well for one or two types.
- Supervised machine learning vs. unsupervised machine learning: Are you able to gather and label a large training set of data to cover every known category, or can you live with an algorithm that learns on its own, but needs some time (and data) to do so?
- Online vs. offline: Do you need real-time results, or do you want the ability to re-examine past data and modify the labeling of past data points (perhaps that blip two weeks ago really wasn’t an outlier)?
Software implementations add a further wrinkle: the best implementations of many methods are specific to a particular programming language library.
Specialization and automation is the way forward
Summary statistics about a data set can be misleading, requiring visualization and human analysis to get the complete picture. Often that analysis is required for selecting the best outlier detection algorithm for a given data set. In certain niches, however, like calculating outliers in time series data of business metrics, fast and accurate outlier finding can be a reality, if machine learning is employed.
As more and more companies ingest and attempt to make use of more and more data, automated outlier detection will become increasingly necessary. One reason is limited time allotted for finding and acting on detected outliers as ecommerce, public health, fintech, adtech, industrial automation, political campaigns and even nonprofits try to keep up in a rapidly changing world. Very limited supply meets increasing demand.
Quickly and accurately determining outliers is a hard problem to solve, even for a restricted use case like spotting anomalies in KPIs, but the payoff is worth it because manually set static thresholds (like those used in traditional BI tools) are insufficient.
One of key reasons for this is due to the fact that metrics often signal a change long before they cross a threshold – if you spotted that change right when it happened, you could stop a problem in its tracks. Another reason is that an alert will be triggered both by a single errant data point or a sustained uptick (or downtick). It’s up to your analysts to determine if any of those are significant enough to analyze further, but in the meantime, it’s another rain drop in the alert storm.
A third big problem is false negatives: legitimate outliers slipping through. In a way, these are worse than false positives because you know right away you need to adjust your threshold because you are being notified of outliers which actually aren’t. With false negatives, you never receive a notification that one slipped by, leading you to erroneously conclude that a particular time series is behaving normally. Meanwhile, you could be losing thousands of dollars per minute due to that price glitch.
Remember, those eyes can only squint for so long.