Fast Approximate Discovery of Inclusion Dependencies (bibtex)

Reference:

, "Fast Approximate Discovery of Inclusion Dependencies", in Bernhard Mitschang, Daniela Nicklas, Frank Leymann, Harald Schoning, Melanie Herschel, Jens Teubner, Theo Harder, Oliver Kopp, Matthias Wieland, Eds., Datenbanksysteme fur Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme (DBIS), 6.-10. Marz 2017, Stuttgart, Germany, Proceedings, vol. P-265 of LNI, pp. 207–226, GI, 2017.

Abstract:

Inclusion dependencies (INDs) are relevant to several data management tasks, such as foreign key detection and data integration, and their discovery is a core concern of data profiling. However, n-ary IND discovery is computationally expensive, so that existing algorithms often perform poorly on complex datasets. To this end, we present FAIDA, the first approximate IND discovery algorithm. FAIDA combines probabilistic and exact data structures to approximate the INDs in relational datasets. In fact, FAIDA guarantees to find all INDs and only with a low probability false positives might occur due to the approximation. This little inaccuracy comes in favor of significantly increased performance, though. In our evaluation, we show that FAIDA scales to very large datasets and outperforms the state-of-the-art algorithm by a factor of up to six in terms of runtime without reporting any false positives. This shows that F strikes a good balance between efficiency and correctness.

Links:

[bibtex] [url]

@InProceedings{KPDFHZZN17FastApproximate,
	AUTHOR = {Kruse, Sebastian and Papenbrock, Thorsten and Dullweber, Christian and Finke, Moritz and Hegner, Manuel and Zabel, Martin and Zöllner, Christian and Naumann, Felix},
	TITLE = {{Fast Approximate Discovery of Inclusion Dependencies}},
	YEAR = {2017},
	BOOKTITLE = {Datenbanksysteme fur Business, Technologie und Web (BTW                2017), 17. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme (DBIS), 6.-10. Marz 2017, Stuttgart, Germany, Proceedings},
	VOLUME = {P-265},
	PAGES = {207--226},
	EDITOR = {Mitschang, Bernhard and Nicklas, Daniela and Leymann, Frank and Schoning, Harald and Herschel, Melanie and Teubner, Jens and Harder, Theo and Kopp, Oliver and Wieland, Matthias},
	SERIES = {LNI},
	PUBLISHER = {GI},
	URL = {https://dl.gi.de/20.500.12116/629},
	ABSTRACT = {Inclusion dependencies (INDs) are relevant to several data management tasks, such as foreign key detection and data integration, and their discovery is a core concern of data profiling. However, n-ary IND discovery is computationally expensive, so that existing algorithms often perform poorly on complex datasets. To this end, we present FAIDA, the first approximate IND discovery algorithm. FAIDA combines probabilistic and exact data structures to approximate the INDs in relational datasets. In fact, FAIDA guarantees to find all INDs and only with a low probability false positives might occur due to the approximation. This little inaccuracy comes in favor of significantly increased performance, though. In our evaluation, we show that FAIDA scales to very large datasets and outperforms the state-of-the-art algorithm by a factor of up to six in terms of runtime without reporting any false positives. This shows that F strikes a good balance between efficiency and correctness.}
}