Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values, the column’s data types, or the most frequent patterns within a column. Metadata that are more difficult to compute involve multiple columns, such as correlations or data dependencies. Research in data dependency discovery focused on developing efficient algorithms for individual dependency types, such as unique column combinations (UCCs), functional dependencies (FDs), order dependencies (ODs), or inclusion dependencies (INDs). However, many downstream tasks, such as data exploration and query optimization, need information about types of dependencies at the same time, requiring the execution of multiple discovery algorithms on a given input dataset.
The goal for this master thesis is to develop a holistic algorithm to discover specifically UCCs, FDs, ODs, and INDs simultaneously. A holistic approach for these four types can optimize execution orders, share intermediate results for additional search space pruning, and re-use temporary data structures. This work can build on the entire research history of our chair in data profiling, including individual discovery algorithms for all mentioned data dependencies and corresponding datasets. [1] already proves that a holistic approach is feasible, but this work should extend the idea to more dependency types and to allow the processing of larger datasets.
Since the discovery of UCCs, FDs, ODs, and INDs entails an exponential search space rendering the space and time complexity of discovery algorithms exponential as well, this work must consider the scalability of the algorithm to be able to process larger datasets. In comparison to [1], we will, therefore, focus on parallelization, caching, and distribution aspects of the algorithm.
[1]: Ehrlich, Roick, Schulze, Zwiener, Papenbrock, Naumann. Holistic Data Profiling: Simultaneous Discovery of Various Metadata. EDBT. 2016. https://openproceedings.org/2016/conf/edbt/paper-20.pdf
For more information please contact Youri Kaminsky.