-
Notifications
You must be signed in to change notification settings - Fork 0
/
03_approaches.tex
166 lines (86 loc) · 38.6 KB
/
03_approaches.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
\chapter{Approaches to Streaming Anomaly Detection}
\label{c:approaches}
\citet{ahmad_unsupervised_2017} describe an ideal real-life anomaly detection algorithm as being able to: work unsupervised and automated, predict in an online fashion, continuously learn and evolve, adapt to concept drift, detect anomalies as early as possible, and avoid misclassifications. Additionally, an ideal algorithm would have a low computational cost in space and time, be supported with strong performance guarantees, and have a minimal number of parameters \citep{bifet_new_2009}.
Based on the features introduced in \cref{c:features}, this chapter will introduce and summarize several anomaly detection approaches from existing research. \cref{tab:approach_overview} provides an overview of how the approaches align with each of the previously defined features, while the descriptive text later in this chapter is oriented toward a concise summarization of the approach instead of covering every single feature.
\section{Probabilistic Approaches}
Probabilistic approaches learn a model of normality based on the distribution of the training data and compare all new test data against that distribution. If it is sufficiently improbable that the distribution has generated the new data (i.e., if it lies in a low-density area of the distribution), that test data is classified as anomalous. Probabilistic approaches have a strong mathematical foundation and are very effective if they manage to estimate the generative distribution. Furthermore, probabilistic methods work in a very space-efficient way in that only the learned distribution needs to be persisted \citep{pimentel_review_2014}.
A critical factor in the complexity of probabilistic approaches is the distinction between \emph{parametric} and \emph{nonparametric models}. The former are based on the assumption that a finite set of parameters derived from the training data can capture every detail of the data. While this results in simpler models, there is also a higher chance of the model not being able to fit the dataset due to high bias (e.g., by trying to fit a gaussian normal distribution on a much more complex phenomenon). Contrarily, the latter approaches learn a distribution without constraints, with fewer assumptions, and with the capability of growing the model with the amount of data, resulting in the ability to learn very intricate patterns \citep{pimentel_review_2014}. Gaussian mixture models combine multiple Gaussian distributions to approximate the true distribution but require a large number of parameters that rise even faster in higher dimensions \citep{schneider_expected_2016}.
\input{03_overview_table}
\clearpage
\restoregeometry
Due to their potential for complexity, nonparametric approaches are also more prone to overfitting, meaning that they too tightly fit a complex pattern and, therefore, do not generalize to unseen data. Additionally, probabilistic approaches, especially nonparametric ones, tend to require more training data to get to a reasonable level of accuracy. When probabilistic approaches are applied to datasets with high dimensionality, the density estimation of the distribution can often become inaccurate due to the inherent sparsity of the dataset \citep{pimentel_review_2014, kontaki_continuous_2011}.
\subsection{Anomalous Payload-based Network Intrusion Detection \citep{hutchison_anomalous_2004}}
Traditional Intrusion Detection Systems (IDS) often work with signatures, which are attack-patterns that have been observed before, and that can be generalized and shared to detect further occurrences of similar attacks. However, zero-day attacks, which are the first occurrences of new threats, cannot be prevented by such signature-based IDS. The PAYL approach presented by \citet{hutchison_anomalous_2004} seeks to close the gap in early detection by applying anomaly detection to recognize new threats as soon as they appear.
PAYL builds a model of the typical network payloads based on their byte frequency (i.e., an abstraction of the content of requests) and compares the distance of new observations to that typical distribution using the Mahalanobis distance. If the Mahalanobis distance of an observation crosses a preset threshold (e.g., a number of standard deviations), that observation is marked as anomalous. In an online processing context, the model incrementally evolves based on new observations and a decay parameter that gradually phases out older observations.
\citet{hutchison_anomalous_2004} show that their approach accurately signals anomalies on labeled data while keeping a false positive rate of less than 1\%. The approach managed to pick up anomalies from an unlabeled dataset but has not been tested on real-life data streams. As PAYL is a simple approach, \citet{hutchison_anomalous_2004} recommend that it be used in ensemble with other IDS for increased coverage and a reduced number of false alarms.
\subsection{New Ensemble Methods For Evolving Data Streams \citep{bifet_new_2009}}
\citet{bifet_new_2009} approach the anomaly detection problem using an ensemble of Hoeffding decision trees that approximates the generative distribution of a multivariate data stream and evolves that distribution incrementally alongside the stream. The individual predictions of each tree in the ensemble are aggregated with bagging and/or boosting, which means that individual decisions are averaged (either weighted or unweighted) to find an ensemble decision. Using a weighted average (i.e., boosting) based on the prediction error of individual trees causes well-performing trees to be weighted higher and partially accounts for concept drift, as outdated predictors that model a past distribution tend to perform worse.
In their best-performing configuration, \citet{bifet_new_2009} also include the ADWIN change detector in their system. ADWIN is an algorithm that buffers a window of past events, which allows computing the statistics necessary to detect a change in concepts. The ADWIN bagging that \citet{bifet_new_2009} apply makes use of ADWIN to estimate the weights of the predictors in the ensemble. When ADWIN detects a change in concepts, the algorithm removes the predictor with the highest error and adds a new one to the ensemble.
The ensembles presented in \citet{bifet_new_2009} outperform the baseline Naive Bayes model and the other decision tree algorithms they were benchmarked against. However, the ensemble approach has a relatively high overall complexity in the time and memory dimensions, as it combines several systems to make a decision. While ADWIN bagging tends to be the most accurate method in a drifting environment, it is also one of the slowest in terms of processing time.
\subsection{Expected Similarity Estimation for Large-Scale Batch and Streaming Anomaly Detection \citep{schneider_expected_2016}}
The EXPoSE approach as proposed by \citet{schneider_expected_2016} is a kernel-based algorithm that constructs an embedding from the underlying data and, by working in that embedding space, can efficiently calculate the similarity of new observations and normal data. EXPoSE does not make assumptions regarding the size or shape of the underlying generative distribution (except that there is a distribution), which is advantageous when compared to parametric statistical models. Furthermore, the algorithm is able to process complex multivariate data like images and videos.
EXPoSE can learn incrementally in constant time per observation and, when trained as such, produces the same model as when trained offline. Additionally, it can also make predictions online and in constant time, using constant memory. The design of the algorithm allows for parallelization, which makes it much more applicable to big data problems. EXPoSE can be extended to account for concept drift by working on sliding windows over the data stream, or by applying a gradual forgetting mechanism to outdated observations. The main advantage of EXPoSE lies in its higher efficiency and lower complexity when compared to state-of-the-art algorithms that, overall, perform similarly well.
\subsection{Anomaly Detection in Streaming Nonstationary Temporal Data \citep{talagala_anomaly_2019}}
\citet{talagala_anomaly_2019} propose an approach based on a bivariate two-sample nonparametric test. The approach projects several univariate time-series into a two-dimensional problem space and applies the statistical test to find anomalous sequences. Before the approach can process new observations from multiple data streams in an online fashion, it first needs to be warmed-up using a batch of typical data. This warm-up period serves the initialization of model parameters to non-arbitrary values.
To account for drift, \citet{talagala_anomaly_2019} propose an informed approach that evaluates the statistical distance between the current typical distribution and the distribution of the current test window. The model only needs to be updated when that distance grows significantly large, which allows for quicker decisions and reduced computational complexity. In their evaluation, \citet{talagala_anomaly_2019} find that their framework performs well when applied to nonstationary streams with noise and multi-modal distributions (i.e., more than one kind of typical behavior).
\section{Distance-based Approaches}
Distance-based approaches are based on the concepts of neighbors and cluster analysis, which are generally both easy to implement and interpret \citep{schneider_expected_2016}. While normal data is in close proximity to other data (i.e., tightly clustered or close to its neighbors), anomalies are scattered far off from the majority of other data. The first subcategory of distance-based approaches consists of nearest-neighbor approaches like k-nearest-neighbor (kNN), which are among the most popular for anomaly detection in general. However, these methods are not easily applicable in an incremental learning streaming context \citep{pimentel_review_2014}. The second subcategory is comprised of clustering approaches like k-means clustering, which are approaches that have been applied to the streaming context in a considerable amount of research.
Distance-based approaches are highly dependent on a suitable distance measure. Approaches often work under the assumption that the data space is metric, as this allows for efficient indexing \citep{kontaki_continuous_2011}. As distance-based methods rely on the computation of distances between data points, they run into scalability and efficiency issues when applied to high-dimensional datasets. The complexity of the individual distance calculations can be high, and, as there is a high number of operations, the overall computation can grow to be impractical. Additionally, distance measures tend to fail in distinguishing anomalies and normal data in high-dimensional contexts due to the ``curse of high dimensionality'' \citep{pimentel_review_2014}.
In a streaming context, distance-based approaches present several challenges like maintaining a correct clustering over time, using small amounts of time and memory, a requirement for incremental clustering, and the appearance and disappearance of clusters due to concept drift \citep{gama_survey_2012}. As the distance computations cause a relatively high computational complexity in the test phase, distance-based approaches need to be carefully tuned for usage in an online context, as they could otherwise cause bottlenecks \citep{pimentel_review_2014}. A foundational element that is often used to improve the efficiency in data stream clustering is the idea of cluster features. As \citet{gama_survey_2012} state: ``A cluster feature, or microcluster, is a compact representation of a set of points.'' Algorithms that work with cluster features first reduce the overall data points to local microclusters. Based on these microclusters, it takes much less work to generate a correct overall clustering, and the algorithm requires less of the very constrained memory \citep{gama_survey_2012, cao_density-based_2006}.
Additional challenges include the fact that clustering algorithms for stream environments cannot assume an initial or fixed number of clusters or other parameters, as the stream continuously evolves with new clusters appearing and old ones disappearing. The clusters can have an arbitrary shape, and new observations can initialize a new cluster if they arrive far from any existing cluster. Algorithms also need to be able to handle noise, as data streams are susceptible to sensor failures and other random events. An ideal algorithm would even be able to distinguish noise and anomalies from newly developing normal clusters \citep{cao_density-based_2006}.
\subsection{Density-Based Clustering over an Evolving Data Stream with Noise \citep{cao_density-based_2006}}
The DenStream approach introduced by \citet{cao_density-based_2006} builds on the idea of cluster features and builds its internal model of normality using core-micro-clusters, which are dense approximations with attributes for weight, center, and radius \citep{miller_twitter_2014}. Core-micro-clusters can approximate clusters of arbitrary shape and size and are used to represent clusters of both normal observations and anomalies. New observations can then be evaluated against the existing clusters, and if warranted, marked as anomalous.
DenStream applies a damped window to the underlying data stream, allowing it to weigh observations with a decay function (i.e., old observations are weighted exponentially less than new ones) so that only a part of the infinite stream needs to be kept in memory. The observations in the window are summarized into compact micro-clusters (which are more than natural clusters, but less than observations). To account for the appearance of new clusters, clusters of outliers (o-micro-clusters) that reach a certain weight are promoted to potential core-micro-clusters (p-micro-clusters), and vice-versa for p-micro-clusters that shrink comparably. After outlier and potential micro-clusters have been formed, DenStream applies a variant of the DBSCAN clustering algorithm to all p-micro-clusters, which yields a final clustering. Each p-micro-cluster is represented as a point with a weight derived of its size.
In general, DenStream achieves a very good clustering quality with a purity that is consistently above 95\% when applied to the evaluation datasets. The evaluation shows that DenStream can reliably separate noise from relevant observations and that it works much more memory-efficiently than comparable algorithms. The computational complexity of DenStream is linear in the number of observations processed, which means that it is well capable of handling high-velocity data streams.
\subsection{Continuous Monitoring of Distance-Based Outliers over Data Streams \citep{kontaki_continuous_2011}}
\citet{kontaki_continuous_2011} introduce a distance-based approach that reduces the number of computations with a novel event-based framework that only schedules distance-computations if potentially relevant events occur. For example, an observation that is considered to be normal based on the current window can only become anomalous if its neighbors leave the current window, which is an event that can be planned for.
By tracking the expiration times of objects in the current window, related computations only need to be executed when such an object leaves the window. As the same kind of scheduling cannot be done for the inverse event (i.e., new arrivals cannot be planned for), the approach only schedules the expiration of existing normal observations. An additional optimization applied in COD is based on the notion that certain normal observations can never become anomalous and thus do not need to be watched in the event queue. This notion of “safe inliers” captures the idea that if there are enough neighbors following an observation in the current window, even the expiration of all its preceding neighbors cannot make it anomalous.
As an extension of their COD algorithm, \citet{kontaki_continuous_2011} also propose the MCOD algorithm that extends COD with the notion of micro-clusters (i.e., dense regions of normal observations). MCOD applies micro-clusters to outlier detection in that it classifies any observation within a micro-cluster as certainly normal, and any observation outside as an anomaly. Using micro-clusters as such significantly reduces the number of distance computations that need to be performed. Overall, the focus of the evaluation has been on the overall efficiency, which greatly improved in comparison to an existing benchmark algorithm (i.e., Abstract-C).
\subsection{Twitter spammer detection using data stream clustering \citep{miller_twitter_2014}}
The amount of spam on the Twitter platform has dramatically increased in recent years, which can be attributed to the growing pervasiveness of social media platforms in general. To address that spam problem, \citet{miller_twitter_2014} applies a density-based clustering approach that can learn clusters of normal tweets and subsequently judges unknown tweets based on their distance to these clusters. Each tweet is represented by a vector containing 107 numerical features that summarize both content-related and contextual properties.
The first layer in the system, the StreamKM++ algorithm, is based on an extended k-means++ algorithm \citep{ackermann_streamkm_2012}. The extensions that make it applicable to data streams is the windowing of the data stream in conjunction with the data reduction performed using coresets, which are approximative aggregations of observations. To improve the performance of the system, a second layer based on the DenStream algorithm makes another pass on the observations processed by the first layer. While the first layer is optimized such that it produces a minimal amount of false-negatives and, conversely, produces more false-positives, the second layer processes the stream more efficiently by only reconsidering flagged anomalies (normal observations are assumed to be flagged correctly).
The overall system processes observations online and only requires a single pass at each observation. The application of a two-layer approach significantly reduces both false-positive and false-negative rates and improves all other metrics when compared with each layer in separation. After tuning parameters on a training set of tweets, the two-layer system manages to achieve very high accuracy (>95\%), 100\% recall, and a low rate of false positives on the test set.
\subsection{NETS: Extremely Fast Outlier Detection from a Data Stream via Set-Based Processing \citep{yoon_nets:_2019}}
\citet{yoon_nets:_2019} propose a distance-based approach that exploits a newly discovered characteristic when windowing a data stream: namely, that ``the change in the locations of data points in the data space is typically very insignificant.'' As new observations enter the window and old ones expire, it is very probable that new observations replace expired ones. \citet{yoon_nets:_2019} suggest that this leads to many redundant distance-computations that can be avoided if observations at similar locations are handled in groups, and the so-called ``net effect'' is exploited (i.e., the effect that new observations can cancel out the effects of expired ones if they arrive in similar places).
To exploit the ``net effect'', \citet{yoon_nets:_2019} propose a set-based approach that processes both the expired window and the new window concurrently and groups spatially close observations, allowing the approach to skip calculations that would be redundant (e.g.,+1 and -1 observation in the same place). In comparison, a traditional approach would first remove expired observations from the window, then add new observations, and finally compute anomalies on that window. Due to the ``curse of high dimensionality'', the efficiency of NETS drops as the dimensionality of the data grows to a large scale. To ameliorate the issue, NETS applies two-level dimensional filtering, which means that it detects anomalies and normal data in a subspace with fewer dimensions, after which it processes the remainder of the data in the full-dimensional space. This approach allows NETS to process even a sparse high-dimensional data stream efficiently.
As \citet{yoon_nets:_2019} show in their evaluation, these optimizations allow NETS to outperform other state-of-the-art algorithms like MCOD and LEAP by several orders of magnitude (10x and 24x, respectively). The time-complexity of NETS is lower than comparable algorithms, but the space-complexity is about the same.
\section{Reconstruction-based Approaches}
Approaches with a foundation on reconstruction perform a regression on the training set to predict a new event and compute the distance between the predicted value and the actual observation. The anomaly score of an event can be computed based on that distance (i.e., the reconstruction error), allowing for the classification of an incoming event as anomalous or normal. Reconstruction-based approaches work without any assumptions about the data and work well even in high-dimensionality contexts. They tend to have high complexity in the training phase but are fast when it comes to evaluating a new observation \citep{pimentel_review_2014}.
A special set of reconstruction-based approaches are called spectral methods, which are methods that transform the high-dimensional data space into a lower-dimensional one before attempting to distinguish anomalies from normal observations. These methods work under the assumption that anomalies are easier to distinguish in fewer dimensions. The most common technique for performing such a reduction in dimensionality is Principal Component Analysis (PCA) \citep{pimentel_review_2014}.
Another subgroup of reconstruction-based approaches, the neural-network methods, is based on a hill-climbing optimization of a fixed set of parameters. Neural-network models can be highly sensitive to the parameters they are trained with, which can make it difficult to use them with high-dimensional data. Additionally, as many reconstruction-based approaches, including neural networks, are based on iterative optimization, the choice of optimization method is often critical for the success of the model \citep{kanarachos_detecting_2017}.
\subsection{Contextual anomaly detection framework for big sensor data \citep{hayes_contextual_2015}}
\citet{hayes_contextual_2015} apply a two-layer approach consisting of a content anomaly detector that detects spatial anomalies with a fast probabilistic predictor, as well as a contextual anomaly detector that subsequently processes anomalies with a clustering-based approach. The contextual detection layer includes contextual information by building clusters for known sensor profiles (i.e., the average over a set of sensors in the same context) and subsequently training a classifier for each cluster. Once a point anomaly is detected based purely on its content, it can be further evaluated if the anomaly persists as a contextual anomaly in the context of its sensor profile (i.e., by checking whether it belongs to the specific cluster). This two-layer approach has the potential of significantly reducing the rate of false-positives. However, as both layers of the system are trained offline, a periodical retraining would be necessary to account for concept drift.
When compared against the outliers package as included in the R distribution, the contextual detection framework in \citet{hayes_contextual_2015} performed equally well but introduced the capability of working in an online environment. The described performance benefit, however, only holds under the assumption that the algorithm processes many low-dimensional records instead of fewer high-dimensional ones, as it focuses on reducing the number of records running through the entire process. Furthermore, the approach includes several optimizations specific to the big data context. The algorithm has been designed to be parallelizable such that only the contextual clustering layer requires centralized execution. The content-based detection layer can be executed in parallel on a cluster of machines. Additionally, the computational requirements are reduced by the layering approach, as the computationally expensive contextual detector only needs to process events that have been marked as anomalies by the fast content-based detection layer.
\subsection{Time Series Anomaly Detection \citep{shipmon_time_2017}}
\citet{shipmon_time_2017} perform a reconstruction of observations in univariate time-series using neural-network-based regression. The reconstruction-error of their model is analyzed with an ensemble of a probabilistic gaussian tail-probability thresholding and an accumulator that increases for collective anomalies (i.e., on sequences of point anomalies), thereby filtering out anomalies caused by pure noise. As their model profits from a higher number of features, \citet{shipmon_time_2017} extend the dimensionality of their time-series by engineering additional features from the timestamp and past labels (e.g., by using derivatives).
When evaluating their approach on various static datasets, both artificial and real ones, and comparing the results with simple baseline models, \citet{shipmon_time_2017} show that the approach shows robust results on periodic time-series, but does not perform well on non-periodic ones. While the used neural-network can, in principle, be trained incrementally, the need for labeling would make it hard to apply the approach to online processing scenarios.
\subsection{Unsupervised real-time anomaly detection for streaming data \citep{ahmad_unsupervised_2017}}
\citet{ahmad_unsupervised_2017} propose a ``theoretical framework for sequence learning in the cortex'' that applies the Hierarchical Temporal Memory (HTM) sequence learning methodology to the domain of anomaly detection. HTM is based on principles of neuroscience that have been transferred to the area of machine intelligence, resulting in a system that emulates synapses and neurons in a human brain.
The anomaly detector introduced by \citet{ahmad_unsupervised_2017} builds an internal model from spatiotemporal data and continuously evolves during processing. When processing a new observation, the HTM system predicts expected future behavior and compares it to the real observation to get a measure of the reconstruction error. Based on the reconstruction error, the approach probabilistically computes the likelihood of the system being in an anomalous state, and thresholds that likelihood to reach its final decision. As the threshold is not directly applied to the reconstruction error, the number of false alarms because of noise can be significantly reduced.
In their evaluation based on the open-source NAB scoring system and datasets, \citet{ahmad_unsupervised_2017} show that their approach outperforms almost any other of the evaluated state-of-the-art approaches, especially when a low number of false-negatives is rewarded. However, the HTM approach incurs a significantly higher latency (i.e., time for processing a new observation in real-time) when compared to these same other approaches.
\subsection{Detecting anomalies in time series data via a deep learning algorithm combining wavelets, neural networks and Hilbert transform \citep{kanarachos_detecting_2017}}
The framework proposed by \citet{kanarachos_detecting_2017} transforms univariate time series into multiple signal subcomponents (each on a different time-scale). This so-called ``wavelet transformation'' reduces the influence of noise and enables the framework to evaluate the input at different time-scales. Every subcomponent (``wavelet'') is processed by a dedicated neural network that learns the normal behavior of its respective time-scale. Combining the component networks and integrating them with additional layers then results in a deep neural network model that can reconstruct the expected signal. The final decision of the anomaly detector is based on features extracted by applying a Hilbert transformation to the error signal (i.e., the time series of reconstruction errors), and by thresholding the resulting values with a probabilistic method that optimizes statistical significance and ROC.
Using a neural network enables incremental learning and online processing, as weights can be updated with each new observation (or mini-batch thereof). However, the anomaly detector detects temporal anomalies (i.e., anomalies in a time context) based on signal frequencies and transformations and is, therefore, dependent on having a window of observations available. Furthermore, while incremental learning aids in the evolution of the model and in recognizing emerging patterns, concept drift has not been explicitly accounted for in the approach (e.g., by means of expiration).
\citet{kanarachos_detecting_2017} evaluated their work on two real-life use cases from signal processing (e.g., earthquake detection), and showed that the approach could be applied to different datasets without requiring significant manual tuning or the possession of domain knowledge, which was the main goal of the overall work.
\section{Domain-based Approaches}
Domain-based methods compute a decision boundary around the ``normal area'' of a dataset without deriving an explicit generative distribution. Observations that lie outside of this boundary can then be classified as anomalies. While working only on the domain is advantageous and requires relatively less training data, this also means that a domain-based model is easily negatively influenced by outliers in the dataset \citep{pimentel_review_2014}.
Support Vector Machines (SVM) are a popular method for domain-based anomaly detection as they build a decision boundary (a hyperplane) using only data points close to the border (i.e., support vectors), ignoring the data points within the boundary. Working only with support vectors means that SVMs do not depend on knowing the distribution of the training data and can work only with the domain. As SVMs are only applicable to a batch processing context, there has been some research on other domain-based methods that can work with online processing and incremental learning \citep{pimentel_review_2014}.
\subsection{Isolation Forest \citep{liu_isolation_2008}}
\citet{liu_isolation_2008} propose an anomaly detection approach that isolates observations on a multivariate data stream by recursively partitioning the data space with decision-tree rules. As an anomaly is easier to isolate than normal observations, anomalies tend to require only a few levels of partitioning and should thus be located higher-up in the resulting decision-tree. The approach exploits this pattern in computing an anomaly score based on the normalized path-length of a new observation. To account for the randomness in tree partitioning, the approach combines a parametrized number of trees in an ensemble and uses the more robust averaged statistics. Additionally, \citet{liu_isolation_2008} apply a subsampling strategy that enables learning on a small part of the overall data (e.g., a subset of 256 observations).
In-depth evaluation shows that the approach achieves the highest AUC on most of the considered evaluation datasets. The subsampling strategy does not significantly impact accuracy but reduces the processing time of the approach by a large amount, allowing the approach to work faster than benchmarked approaches (on large datasets). As the approach separates training and testing stages, it has to be retrained periodically to account for concept drift in evolving data streams.
\subsection{Data Stream Anomaly Detection through Principal Subspace Tracking \citep{dos_santos_teixeira_data_2010}}
\citet{dos_santos_teixeira_data_2010} introduce a new approach for domain-based anomaly detection that works in a projected subspace of univariate numerical input streams. The proposed approach performs a dimensionality reduction in an incremental way (by estimation) and tracks the number of latent variables needed to explain the observations in the resulting subspace. An increase in the number of required variables indicates the beginning of a new collective anomaly.
Existing approaches most often apply PCA and are not practical in the streaming context, as they require expensive calculations or need to persist all data in memory. Additionally, such methods tend to be parameterized for the number of dimensions in the projection space, which makes it hard to adapt to changes in the stream. The method proposed by \citet{dos_santos_teixeira_data_2010}, which is called FRAHST, automatically estimates the number of dimensions and incrementally estimates the statistical properties of the data using a decay factor, allowing the approach to efficiently perform dimensionality reduction in a dynamic and continuously evolving context.
The advantages of FRAHST are its robustness and its practicable complexity, as well as the low number of parameters (decay rate and desired error). When evaluated against four other algorithms on several real and synthetic datasets, FRAHST is the only technique that consistently identifies anomalies with an F1 score above 80 percent. Additionally, the algorithm has been implemented and tested in the data centers of an internet provider, which motivates its applicability to a real context.
\section{Information-theoretic Approaches}
Approaches based on information theory calculate the information content of data with measures like entropy. If a new observation significantly changes the information content of the overall dataset, that observation can, therefore, be considered anomalous. Information-theoretic methods are highly dependent on the choice of the measure for information, as the measure needs to be able to detect the effects of anomalies, which is only possible if that effect is sufficiently large. Additionally, it is often challenging to derive a suitable score from the measured effects. The exact computation of information content is also relatively expensive (i.e., the test phase), which is a critical impediment in the streaming context \citep{pimentel_review_2014, kanarachos_detecting_2017}.
\subsection{Entropy-Based Anomaly Detection for In-Vehicle Networks \citep{muter_entropy-based_2011}}
As automotive systems are increasingly connected and integrated with external systems, their attack surface grows likewise. To address this issue, \citet{muter_entropy-based_2011} propose a system that analyzes vehicle network communications at runtime. As vehicles have a very long lifetime but are hard to update, \citet{muter_entropy-based_2011} suggest that an anomaly detection system could be a better solution than traditional signature-based intrusion detection that needs to be updated regularly.
\citet{muter_entropy-based_2011} define their notion of normality using the concept of entropy, as it does not require knowing anything about the data. As \citet{muter_entropy-based_2011} state, ``entropy allows an abstract representation of the randomness of this data.'' While entropy can be high in regular computer networks with high randomness in network communications, the communication in a vehicular network is highly standardized (e.g., with allowed values, value ranges, and frequencies), which results in lower entropy. Because of this, changes in entropy (e.g., a sudden drop due to flooding with many identical messages) can serve as a suitable anomaly detection measure in such a network.
In their evaluation on real vehicles, \citet{muter_entropy-based_2011} find that their approach can successfully identify manually executed attacks in vehicular networks and is more practically applicable than related work. Firstly, it only requires data that represents normal behavior, not explicit specifications about message formats or content restrictions, which makes it easy to transfer and deploy to a variety of vehicles. Additionally, the approach is usable in the low-resource embedded computing context, as it is based on relatively cheap computations like tracking changes in entropy and applying a threshold.
\subsection{Statistical Techniques for Online Anomaly Detection in Data Centers \citep{wang_statistical_2011}}
Current approaches for data center monitoring are often based on assumptions about the distribution of the data and thresholds that are trained offline and kept at a constant level during processing. Therefore, the methods are not well-fit for the evolving patterns and bursty network behavior prevalent in data centers. To improve on this, \citet{wang_statistical_2011} introduce a new online processing approach for univariate time series that processes streaming data in windows and examines the window contents using a statistical test. More specifically, the distribution of observations in each window is compared to the distribution of observations in different periods/contexts using a test based on relative entropy (i.e., multinomial goodness-of-fit). When the two distributions do not match well enough, the null hypothesis can be rejected, and the window is marked as anomalous.
To account for changing usage patterns of network traffic based on the time-of-day, \citet{wang_statistical_2011} additionally compute the distribution of different temporal contexts (e.g., traffic between 11:00 and 12:00) and compare windows to the matching context. In their evaluation, \citet{wang_statistical_2011} show that the approach performs better than comparable Gaussian approaches, especially if different contexts are evaluated to form an ensemble decision (e.g., the corresponding time bucket, the recent past, and all past). Furthermore, the computations performed by the approach are lightweight in both time and space, as all that is required is the tracking and computation of statistics regarding frequencies of occurrence.
\subsection{Online Anomaly Detection over Big Data Streams \citep{rettig_online_2015}}
\citet{rettig_online_2015} introduce an online anomaly detection approach that detects contextual (temporal) anomalies on multiple data streams. The approach continuously maintains aggregate statistics across windows on a stream (relative entropy) and across streams (correlation coefficient). These separate measures enable the detection of anomalous changes in the behavior of one stream (i.e., if the relative entropy changes significantly), as well as the detection of behavior that is anomalous in the context of a related stream (e.g., abrupt changes causing the correlation to drop). \citet{rettig_online_2015} implemented their approach in a real stream processing system (Kafka and Spark) to showcase their focus on generalization and real-world applicability.
When evaluated on a real telecommunications infrastructure, the approach detects both anomalies caused by users and anomalies through infrastructure failure. Additional experiments with different numbers of processing nodes showed that the approach scales well and can be used to process large amounts of data simultaneously. The overall evaluation is predominantly based on domain knowledge and in-depth manual analysis, which clarifies the reasons for the observed results, but could make the evaluation harder to replicate.