Shaabani, Nuhad; Meinel, Christoph
Proceedings of the 20th International Conference of Database Systems for Advanced Applications (DASFAA2015)
Lecture Notes in Computer Science
Inclusion dependencies within and across databases are an important relationship for many applications in anomaly detection, schema (re-)design, query optimization or data integration. When such dependencies are not available as explicit metadata, scalable and efficient algorithms have to discover them from a given data instance. We introduce a new idea for clustering the attributes of database relations. Based on this idea we have developed S-indd, an efficient and scalable algorithm for discovering all unary inclusion dependencies in large datasets. S-indd is scalable both in the number of attributes and in the number of rows. We show that previous approaches reveal themselves as special cases of S-indd. We exhaustively evaluate S-indd's scalability using many datasets with several thousands attributes and rows up to one million. The experiments show that S-indd is up to 11x faster than previous approaches.