SPIDER - Data Profiling
Many real life databases lack sufficient structural information such as foreign keys. These constraints are often not defined due to performance reasons, lacking knowlegde of the data, or due to dirty data, which do not entirely hold the constraints. Thus, we want to detect foreign keys automatically.
This problem is devidable in two steps: First, find all inclusion dependencies, i.e., attributes A and B such that all values of A are included in all values of B. This definition fits the syntactical and automatically testable part of a foreign key constraint. In the second step, we want to find heuristics to filter foreign keys from inclusion dependencies.
We developed our algorithm SPIDER (Single Pass Inclusion DEpendency Recognition) to detect inclusion dependencies over large schemas. The challenge is the quadratic complexity of the problem in the number of attributes. SPIDER sorts and "distincts" all attributes in the database system. Afterwards, it tests all attribute pairs in parallel while reading all values at most once. We showed that SPIDER clearly outperforms previous approaches for detecting inclusion dependencies exactly.
An extension of SPIDER detects partial inclusion dependencies to handle dirty data and detects composite inclusion dependencies to cover composite keys and foreign keys.
SPIDER was developed in the context of the Aladin project.