Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI

Dependency Discovery for Data Integration

Data integration aims to combine data of different sources and to provide users with a unified view on these data. This task is as challenging as valuable. In my thesis I propose algorithms for dependency discovery to provide information for data integration. I focus on inclusion dependencies (INDs) in general and a special form named conditional inclusion dependencies (CINDs): (i) INDs enable the discovery of structure in a given schema. (ii) INDs and CINDs support the discovery of cross-references or links between schemas. In this talk I motivate my approaches using the domain of life sciences data sources and give an overview of the thesis' contributions. Further, I present the SPIDER algorithm for IND discovery in detail. SPIDER analyzes large data sources up to an order of magnitude faster than previous approaches.