Anomaly detection explained for beginners
Anomaly detection is an extremely powerful technique for identifying critical outliers in a set of data. It is particularly useful where you are trying to spot a rare but mission-critical event. For instance, if you want to spot unusual transactions on someone’s bank account. Or when you need to identify problems with a piece of important machinery before it fails. In this blog, we look at the history and basics of anomaly detection and show how anyone can leverage it.
The origins of anomaly detection
Data science powers some of the most impressive applications of technology in the modern world. It underpins all those things branded ‘big data’, ‘machine learning’, or ‘artificial intelligence’. Over the past decades, it has given us some incredibly useful techniques for analyzing and interpreting data. In turn, these techniques allow us to construct powerful machine learning models. You can then use such models to power AI systems to solve important real-world problems.
Some key data science techniques
Data science is primarily about making usable observations about large sets of data. By ‘usable observations’ I mean identifying features, patterns, and anomalies in the data. In turn, you can use these observations to understand and leverage the data.
In data science, a feature is a specific aspect of a data set that you can quantify in some form. For instance, in a set of accounts, it could be the amount spent. Often, these features may not be so clear-cut. Or you may find that they are incomplete. Feature engineering allows you to clean up the dataset, discarding some features, combining others.
Datasets often exhibit clear patterns. Sometimes, a human can easily spot these—after all, our brain is remarkably good at pattern recognition. So good in fact that we are prone to spotting patterns when none exists. But how do you spot patterns in enormous datasets? Data science offers us several techniques depending on how much we know about the data. For instance, if we know very little, we can turn to K-means clustering to find features in the data. Or if we know more about the data, we can use one of the myriad forms of supervised learning to find the patterns.
An anomaly is a data point that fulfills two key criteria. Firstly, its values differ markedly from normal data values. And secondly, it only occurs very rarely in the dataset. You generally classify anomalies as either univariate or multivariate. Univariate anomalies relate to a single data feature. Multivariate anomalies exist across multiple features. While you can do sometimes do it, detecting anomalies by hand is extremely hard. Especially, if you are working with time-series data.
The basics of anomaly detection
Anomaly detection is a whole field of data science by itself. So, all I can show you here is the very basics. First, I need to explain the different types of anomaly you can find.
Point anomalies happen when a single data point is anomalous. This is the classic outlier on a graph. This is a form of univariate anomaly.
Contextual anomalies require knowledge of the surrounding context. That is, they may only be anomalous under some circumstances. These can be either univariate or multivariate.
Collective anomalies are more subtle. Each data point may not be anomalous, but taken together, you know something is odd.
To identify anomalies, you can use several techniques. One of the common ones is called isolation forest (a form of unsupervised learning). Unsurprisingly (given the name), this is a tree-based method for anomaly detection. You start by choosing a random partition in the data. Next, you recursively subdivide the partition, taking a value between the minimum and maximum. You repeat this process until you have a partition with just one value (the anomaly) or all the data points have the same value. This approach works equally well for 1- and 2-dimensional data.
Many approaches for anomaly detection rely on trying to identify all clusters within the data. If you do it right, any data points that lie outside of clusters are anomalies. There are many approaches to doing this, and it is an active research field.
Here, you are trying to identify how dense the data is within a given neighborhood. If you assume anomalies lie outside of dense areas, you can use this approach to spot them. This requires you to score the potential outlier based on some measure, such as Euclidean distance. You can use several well-known techniques, including k-nearest neighbor or local-outlier factor.
Applications of anomaly detection
You can use anomaly detection to solve a whole range of business use cases. Let’s look at three examples.
Credit and bank card fraud cost the economy billions each year. Spotting fraud is therefore big business. You have several ways to do this. For instance, you can find any point anomalies in the amount spent in a single transaction. A sudden high expenditure may indicate the card is being used fraudulently. Or you might use context-based anomaly detection. You might spot that a card is suddenly used to make a large number of transactions in a foreign country. Either the card-owner is traveling, or the card has been stolen.
Identifying machinery that’s going to fail
In heavy industry and manufacturing, you need to identify potential machine failures before they happen. For instance, in many mining operations, you are reliant on pumps working 24/7. You can identify collective anomalies to spot potential failures. For instance, if the oil pressure, temperature, and engine vibration increase it might mean the oil pump is failing.
Another use case for collective anomalies is intrusion detection. Often, once a hacker is inside your network they will try to copy as much data as possible. You might look for unusual patterns of data copies to identify this. Alternatively, you could use contextual anomalies to detect it. Typically, this might involve spotting that a given user is suddenly accessing data or systems they never did before.
How Sonasoft NuGene helps
If you want to create usable anomaly detection systems, you will find it time-consuming, hard, and expensive. Fortunately, Sonasoft NuGene is designed to solve exactly this sort of problem for you. NuGene is our industry-leading AI bot factory. At its heart lies a unified AI platform that can autonomously create machine learning models for you. If you ask a data scientist for an ML model, they can only create one at a time. But NuGene will try out dozens of different models to find the best one for the job. It then integrates this model into a proper autonomous bot that you can install within your system. You can create almost any type of bot with NuGene. Speak to us if you would like a demonstration of how NuGene can transform your mission-critical processes.
TRY IT FOR FREE
We are confident that our products will exceed your expectations. We want you to try it for free.