Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

SPIDER - Data Profiling

Many real life databases lack sufficient structural information such as foreign keys. These constraints are often not defined due to performance reasons, lacking knowlegde of the data, or due to dirty data, which do not entirely hold the constraints. Thus, we want to detect foreign keys automatically.

This problem is devidable in two steps: First, find all inclusion dependencies, i.e., attributes A and B such that all values of A are included in all values of B. This definition fits the syntactical and automatically testable part of a foreign key constraint. In the second step, we want to find heuristics to filter foreign keys from inclusion dependencies.

We developed our algorithm SPIDER (Single Pass Inclusion DEpendency Recognition) to detect inclusion dependencies over large schemas. The challenge is the quadratic complexity of the problem in the number of attributes. SPIDER sorts and "distincts" all attributes in the database system. Afterwards, it tests all attribute pairs in parallel while reading all values at most once. We showed that SPIDER clearly outperforms previous approaches for detecting inclusion dependencies exactly.

An extension of SPIDER detects partial inclusion dependencies to handle dirty data and detects composite inclusion dependencies to cover composite keys and foreign keys. 

SPIDER was developed in the context of the Aladin project.

publications

  • Jana Bauckmann, Ulf Leser, Felix Naumann, Véronique Tietz: Efficiently Detecting Inclusion Dependencies. International Conference on Data Engineering (ICDE 2007), Istanbul, Turkey, (poster paper, to appear).
  • Jana Bauckmann, Ulf Leser, Felix Naumann, Joachim Schmid: Data Profiling: Effiziente Fremdschlüsselerkennung mit Aladin. German Information Quality Conference & Workshop, Bad Soden, November 2006.
  • Jana Bauckmann: Efficiently Identifying Inclusion Dependencies in RDBMS. 18. Workshop über Grundlagen von Datenbanken (GI-Workshop), Wittenberg, Juni 2006.
  • Jana Bauckmann, Ulf Leser, Felix Naumann: Efficiently Computing Inclusion Dependencies for Schema Discovery. Workshop InterDB (with ICDE06), Atlanta, April 2006.