Data preparation is the process of transforming data before serving them to downstream tasks, such as data analytics, data cleaning, and machine learning. Much data do not meet the requirements of the following tasks, leading users, including both expert data scientists and novice data users, to frequently conduct ad-hoc data preparation. It is reported that preparing data is both labour-intensive and tedious work, which accounts for 50%-80% of the time spent in the whole data lifecycle.
We explore to build a data preparation framework to achieve two goals:
- Enable users to rapidly prepare data
- Enable repeatability of scientific experiments by deriving suitable data preparation specification