Background image HPI with color gradient

New book chapter contributed to "Enterprise AI"

Authors

Lisa Ehrlinger and Felix Naumann

Abstract

The predominant shift from model-centric AI to data-centric AI matches well the needs of enterprises seeking to use and improve AI applications. Several studies show the large amount of time that data scientists spend on searching and cleaning datasets in practical scenarios, instead of designing and training machine learning (ML) models. This imbalance is partially caused by the fact that data is often distributed over different locations in enterprises, is stored in different formats, is heterogeneous, and has varying information quality. The form and speed of data ingestion also influence the types of ML models required for analysis. To ensure high-quality results by ML models, one must first ensure that the data at hand is fit for the intended purpose, i.e., that it is of high quality relative to the task at hand.

After introducing general concepts of data quality, this book chapter reviews data quality for enterprise AI from two perspectives: (1) the data scientist perspective covering concrete methods for data preprocessing to check and improve the quality of a specific dataset at hand and (2) the management perspective covering the implementation of a continuous data quality assurance pipeline within an enterprise that accounts for long-term monitoring and root cause analysis. Although the focus will be on the first perspective, the second perspective is essential for a comprehensive AI strategy within a company. We conclude this chapter with an outlook on open challenges in the area of data quality for AI for both researchers and practitioners.