21.06.2017 - Thorsten Papenbrock (external)
Data Profiling – Efficient Discovery of Dependencies
Data profiling is the computer science discipline of analyzing a given dataset for its metadata. The most important types of metadata are arguably inclusion dependencies (INDs), unique column combinations (UCCs), and functional dependencies (FDs). If present, these dependencies serve to efficiently store, query, change, and understand the data. Most datasets, however, do not provide their metadata explicitly so that data scientists need to profile them.
In this talk, we discuss a novel, hybrid profiling algorithm for the automatic discovery of functional dependencies in relational instances. FDs are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Due to the importance of FDs, database research has proposed various algorithms for their discovery. None of these algorithms is, however, able to process datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records.
Our algorithm HyFD combines fast approximation and sophisticated validation techniques to efficiently discover all minimal FDs in relational datasets. The hybrid approach not only outperforms all existing discovery algorithm, it also scales to much larger datasets. HyFD and further metadata discovery algorithms have been implemented for the Metanome data profiling platform, which is the overall contribution of my PhD thesis.