Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

06.09.2024

Phillip Wenig had successfully defended his Ph.D. dissertation on September 4th, 2024 at the HPI! His work is focused on the topic "Finding, Clustering, and Classifying Anomalies on Large and Multivariate Time Series".

Abstract
Multivariate time series are a form of real-valued sequence data that simultaneously record different time-dependent variables. They originate mostly from multi-sensor setups and serve a variety of important analytical purposes, including the detection of normal and abnormal behavior. Anomalies often occur in individual channels of a time series, but can also be found in the correlation of multiple channels. While effective data mining algorithms exist for the detection of anomalous and structurally conspicuous test recordings, these algorithms do not perform any semantic labelling. So data analysts spend many hours connecting the large amounts of automatically extracted observations to their underlying root causes. The complexity, amount and variety of extracted time series make this task hard not only for humans, but also for existing algorithms: These algorithms either require training data for supervised learning, cannot deal with varying time series lengths, or suffer from exceptionally long runtimes. To facilitate the analysis of anomalies in very large time series, we investigate three types of algorithms in this dissertation: Anomaly Detection, Clustering, and Classification. More precisely, we create an overview of the time series anomaly detection research field and point out short commings with published benchmarks. Then, we propose a novel and scalable time series anomaly detector that can find anomalies in the correlations of time series channels and reveal in which channels anomalies occur. To distribute the anomaly detection computation, we developed a novel library for building reactive and distributed algorithms. Moreover, we propose a fast and effective clustering technique for time series with varying lengths and introduce a framework for counteracting extremely skewed data partitions during the distributed training of machine learning algorithms.