For bachelor students we offer German lectures on database systems in addition with paper- or project-oriented seminars. Within a one-year bachelor project students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, search engines and information retrieval enhanced by specialized seminars, master projects and advised master theses.
Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our data sets and source code.
Data preparation is the process of transforming data before serving them to downstream tasks, such as data analytics, data cleaning, and machine learning. Much data do not meet the requirements of the following tasks, leading users, including both expert data scientists and novice data users, to frequently conduct ad-hoc data preparation. It is reported that preparing data is both labour-intensive and tedious work, which accounts for 50%-80% of the time spent in the whole data lifecycle.
We explore to build a data preparation framework to achieve two goals:
Enable users to rapidly prepare data
Enable repeatability of scientific experiments by deriving suitable data preparation specification
We propose for both metadata and preparators a respective taxonomy. We use the defined metadata and preparators to create standard specifications of data transformations. The whole taxonomies can be found here.
Self-service data preparation enables end users to prepare data by themselves. However, the plethora of possible data preparation steps can overwhelm the user. We introduce a score-based preparator ranking approach to propose preparator candidates in a context-specific manner. To this end, we give scoring functions for a selected set of preparators and outline future work towards a full-fledged data preparation system.
We have compiled the related work corresponding to the individual research topics on data preparation. For the whole list please refer to here.
For further information on this project please contact Lan Jiang.