Master's Project: The Early Bird - Upstream Change Detection for ML Pipelines

General Information

Teaching staff: Lisa Ehrlinger, Sedir Mohammed, Prof. Felix Naumann
Master’s programs: ITSE, DE, SSE, CS
Weekly Meetings: Tuesday, 15.00-16.00, F-2.11 (Campus II)
Content:
- Group work
- Programming project
- Intermediate and final presentation
- Research report
Extent: 8 SWS / 12 ECTS
Project room: F-1.09 (Campus II)

The Impact of Change in Data on AI

High-quality data is the basis for successful artificial intelligence (AI) systems. Here, one of the most important aspects, is data drift, which describes changes in data over time that often result in performance degradation of machine learning (ML) models. So far, data drift is typically investigated after a model has been applied to then determine whether retraining is necessary due to poor prediction accuracy (Gama et al. 2014). We believe that an investigation and detection of changes before the actual prediction task is useful. One example is analytics platforms that offer their data for different downstream tasks. In this scenario, the ML model used is not known apriori. However, annotated information about existing change types in datasets could help ML pipelines decide whether the data is suitable for a specific model.

Change in data can appear in different forms. For example, the measurement unit of a weather sensor might shift from Celcius to Fahrenheit after a firmware update, causing a sudden shift in the data. Other examples include a nearly unrecognizable drift that gradually changes the distribution of a variable, or sudden peaks that occur periodically during specific intervals. Gama et al. (2014) categorize these different types of change into five patterns as shown in Figure 1: (1) sudden/abrupt, (2) incremental, (3) gradual, (4) recurring concepts, and (5) outliers.

Figure showing different patterns of change over time: (1) sudden/abrupt, (2) incremental, (3) gradual, (4) reoccuring concepts, and (5) outliers.

Project Goals

In this project, we will detect changes in the data before an ML model is trained or tested on the data. Following a data-driven approach, we want to (1) detect common change patterns in the data, and (2) annotate the data sets with information about the respective change types, e.g., using histograms that highlight changes. To achieve this goal, we will perform the following tasks:

Model change patterns: In the first task, we will read related work about change patterns and get familiar with their characteristics. We will provide synthetic as well as real world data with labeled change patterns for investigation.
Detection of change patterns: The second task is to design and implement methods to detect the different change patterns in the data sets provided.
Annotate change information: Bleifuß et al. (2018) already explored the dimensions of data changes, including where, when, and how they occur. The third task will be the extension of the change cube model such that it allows annotating information about the respective change pattern to the data set.
Conduct evaluation: Finally, the proposed methods to detect each change pattern should be evaluated.
Prepare a submission to a top database conference.

Prerequisites

Programming experience in Python is required for this project.

Initial Related Work

Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4), 1-37. https://doi.org/10.1145/2523813
Bleifuß, T., Bornemann, L., Johnson, T., Kalashnikov, D. V., Naumann, F., & Srivastava, D. (2018). Exploring change: A new dimension of data analytics. Proceedings of the VLDB Endowment, 12(2), 85-98.

Contact

This project will be supervised by Dr. Lisa Ehrlinger, Sedir Mohammed and Prof. Felix Naumann from the Information Systems group. If you have any questions, please do not hesitate to contact us. You are welcome to visit us in the F building, second floor (on Campus II).