In this master project, we will follow an entire research cycle from problem inception and literature research to algorithm development and, finally, to evaluation. Together, we will prepare a research article and submit it to an international conference.
Our goal is to develop an algorithm that can leverage different types of integrity constraints to detect correspondences between the elements of two or more relational schemas. We start the project with a literature search phase. Afterwards, we decide on and implement (or re-use) baseline approaches to schema matching. Thereafter, we will design and develop our novel COSMA algorithm, which we will evaluate against the baseline approaches. We will regard both quality and efficiency of the approaches and deduct additional experiments showing strengths and weaknesses of the developed algorithm.
The basic idea for our COSMA approach consists of the following steps (but of course we are also open for other ideas):
- Discovery of constraints within the individual databases using the data profiling tool Metanome.
- Building a hypergraph per database where every node corresponds to an attribute and multiple attributes are connected by an hyperedge if they coexist in a constraint. Additionally, the hyperedges are labeled based on their corresponding type of constraint.
- Matching the hypergraphs using an existing algorithm or development of a new solution.
We start with a simple scenario, where the schemas to be compared consist of the exact same set of tables and attributes and we thus focus on the detection of one-to-one corre-spondences between attributes. However, even such a scenario can be quite challenging if the attribute names are cryptic (e.g., "A1" or "xyz") or missing. Moreover, if the values of the matching attributes are encoded using different units of measurement (e.g., metric vs. imperial system) or vocabularies (e.g., {'bachelor', 'master', 'phd'} vs. {0, 1, 2}), they cannot be directly compared using traditional instance-based schema matchers.
Depending on the project's progress, we can extend the scope step by step. Potential extensions are:
- Integration of similarities between attributes from different schemas, calculated us-ing traditional schema matching approaches.
- Integration of constraints across different schemas (e.g., inclusion dependencies).
- Detection of one-to-many and many-to-many correspondences (e.g., by merging nodes within the hypergraph).
- Consideration of approximate constraints (i.e., constraints that are not always valid).