High-quality data is the basis for successful artificial intelligence (AI) systems. Here, one of the most important aspects, is data drift, which describes changes in data over time that often result in performance degradation of machine learning (ML) models. So far, data drift is typically investigated after a model has been applied to then determine whether retraining is necessary due to poor prediction accuracy (Gama et al. 2014). We believe that an investigation and detection of changes before the actual prediction task is useful. One example is analytics platforms that offer their data for different downstream tasks. In this scenario, the ML model used is not known apriori. However, annotated information about existing change types in datasets could help ML pipelines decide whether the data is suitable for a specific model.
Change in data can appear in different forms. For example, the measurement unit of a weather sensor might shift from Celcius to Fahrenheit after a firmware update, causing a sudden shift in the data. Other examples include a nearly unrecognizable drift that gradually changes the distribution of a variable, or sudden peaks that occur periodically during specific intervals. Gama et al. (2014) categorize these different types of change into five patterns as shown in Figure 1: (1) sudden/abrupt, (2) incremental, (3) gradual, (4) recurring concepts, and (5) outliers.