An inclusion dependency between two attribute sets X1,...,Xn and Y1,...,Yn of two relations R and S is defined in the following: For each tuple of values appearing in columns X1,...,Xn of Relation R, there must be an equivalent tuple in columns Y1,...,Yn in relation S. In short this is denoted as R[X1,...,Xn] ⊆ S[Y1,...,Yn].
Inclusion dependencies are an important topic in the research area of data profiling, which aims to extract new metadata for tables. In particular, inclusion dependencies are a prerequisite for foreign keys and thus are a starting point to detect meaningful relationships between tables. Existing work has suggested algorithms for the discovery of inclusion dependencies in databases [1] and webtables [2].
While the discovery of inclusion dependencies has been researched for static databases or webtables, the discovery of inclusion dependencies in historical data remains an open problem. In particular, it should be discussed how the definition of inclusion dependencies for database snapshots (as explained in the first paragraph) should be generalized to table histories and whether the development of the database over time can be used to reduce the search space.
Tasks in this thesis include:
Define temporal inclusion dependencies as candidates for foreign keys between two dynamic tables.
Design of a scalable, distributed algorithm for the discovery of inclusion dependencies. Implementation of the Algorithm in Apache Spark
Creation of a Gold-Standard on webtables
Evaluation on real-world datasets, such as all relational tables in the History of Wikipedia (the dataset is already available)
For more information please contact Prof. Felix Naumann, Tobias Bleifuß or Leon Bornemann.
[1] Papenbrock, Thorsten, et al. "Divide & conquer-based inclusion dependency discovery." Proceedings of the VLDB Endowment 8.7 (2015): 774-785.
[2] Tschirschnitz, Fabian, Thorsten Papenbrock, and Felix Naumann. "Detecting inclusion dependencies on very many tables." ACM Transactions on Database Systems (TODS) 42.3 (2017): 18.
Further Related Work on adapting Dependencies to Temporal Databases:
[3] Jensen, Christian S., Richard T. Snodgrass, and Michael D. Soo. "Extending existing dependency theory to temporal databases." IEEE Transactions on Knowledge and Data Engineering 8.4 (1996): 563-582.
[4] Artale, Alessandro, and Enrico Franconi. "Temporal ER modeling with description logics." International Conference on Conceptual Modeling. Springer, Berlin, Heidelberg, 1999.