Denial constraints (DCs) are the de facto language to specify integrity constraints [1], which have various usages, such as database design, data integration, query optimization, or data cleaning. DCs generalize unique column combinations, functional dependencies, and order dependencies. Each DC defines a set of predicates for which its predicates cannot hold true simultaneously. This expressive power, however, comes with the cost of a very large search space. So, discovering denial constraints (DCs) is computationally expensive. Researchers have developed efficient algorithms to discovery DCs, such as FastDC [2], BFastDC [3], HYDRA [4], or DCFinder [1]. Their application is limited to rather small datasets, though, because executing those algorithms on datasets with around 1 mio. rows and 20 attributes already takes hours [1]. Research for other data dependencies has shown that dynamic parallelization and distribution techniques can decrease the algorithm runtimes enough to make the processing of larger datasets possible [5].
The goal of this master thesis is to develop a distributed algorithm to efficiently discover denial constraints in large datasets.
[1] Eduardo H. M. Pena, Eduardo C. de Almeida, and Felix Naumann. Discovery of Approximate (and Exact) Denial Constraints. PVLDB, 13 (3) (2019). DOI:10.14778/3368289.3368293
[2] Xu Chu, Ihab. F. Ilyas, and Paolo Papotti. Discovering denial constraints. PVLDB, 6 (13) (2013). DOI:10.14778/2536258.2536262
[3] Hai Liu, Dongqing Xiao, Pankaj Didwania, and Mohamed Y. Eltabakh. Exploiting soft and hard correlations in big data query optimization. PVLDB, 9 (12) (2016). DOI:10.14778/2994509.2994519
[4] Tobias Bleifuß, Sebastian Kruse, and Felix Naumann. Efficient denial constraint discovery with Hydra. PVLDB, 11(3) (2017). DOI:10.14778/3157794.3157800
[5] Sebastian Schmidl and Thorsten Papenbrock: Efficient Distributed Discovery of Bidirectional Order Dependencies. The VLDB Journal (2022). DOI:10.1007/s00778-021-00683-4
For more information please contact Youri Kaminsky. Supervision could be in cooperation with Eduardo Pena from Universidade Tecnológica Federal do Paraná (UTFPR) in Toledo, Brasil.