Information Systems Group
Hasso Plattner Institute
Office: Prof.-Dr.-Helmert-Straße 2-3 D-14482 Potsdam Room: F-2.05
Tel.: +49 331 5509 274
Supervisor: Prof. Dr. Felix Naumann
We live in the age of technology where data is the new oil, and much like its predecessor, data must be extracted and refined before it has any practical use. However, unlike its predecessor, the amount of data generated is enormous and growing exponentially, spurred by surveillance devices generating sensor data, social media platforms, government data portals, medical research projects, etc. Unfortunately, data curated from these devices and platforms are often in a raw format, so parsing it without a standardized format introduces many structural inconsistencies, such as invalid characters due to incorrect parsing, column shifting due to incorrect escaping, and inconsistent formatting, etc., which causes data manipulation problems. Consequently, data scientists and machine learning engineers spend most of their time on the tedious tasks of data preparation.
My research aims to provide a data preparation platform to help end-users correctly parse and accurately prepare data by solving the structural problems mentioned above. To this end, I began my research by examining available data preparation tools and libraries to develop an understanding of the existing systems and their data preparation features. We began our survey with the discovery phase and collected more than 100 tools and libraries, which we narrowed down to 42 commercial systems that offer some data preparation capabilities. We made several contributions in our study, such as proposing broader categories of data preparation, identifying 40 common data preparation tasks, evaluating the state of the art of data preparation tools, and listing the prominent challenges that we came across that may lead to different research topics.
After looking at several tools, libraries and their features, I found that all the available tools offer great features, but what interested me the most was that these tools assume the pre-processed input files (no structural errors). Therefore, I focused my research on improving the pre-processing of raw data such as txt, csv, tsv, etc. To improve the structure of a file, we need to understand the structure of the records in that file. To do so, I have developed a system, SURAGH, that finds structural patterns in an input file and classifies clean (well-formed) and problematic (ill-formed) rows based on a pattern schema.
I am currently actively working on a project, TAHARAT, to clean up the structure of identified problematic rows (ill-formed) so that users can load their data efficiently.