Project members

Felix Naumann
Ulf Leser
Jana Bauckmann
Benjamin Emde

Aladin (ALmost Automatic Data Integration) is a database integration project for automatically integrating databases in the life sciences domain. As opposed to most integration projects, Aladin works not schema-centric but data-centric. This is based on the fact that life science databases exist in various representations with different (or even bad) schema quality. So the same term (e.g. protein) used in two different data sources does not reliably mean that both data source use the same definition on it. Furthermore, attributes are often represented as strings - even for numerical values.

One fundamental aspect for Aladin is based on our experiences with life sciences databases: In the life sciences domain, databases typically contain one major class of data (gene, protein, sequence, etc.) with structured annotations; we call the relation modelling the major class of data primary relation and the relations modelling the annotations as secondary relations. The identifier of the major objects are called accession numbers, and these are used as link targets when databases refer to each other. So, we want to utilize the primary relation and the accession number for automatic database integration.

Integration Steps

The proposed integration process consists of five steps that can be seen in the upper figure: In the first step, the data source that is to be integrated has to be imported into relational format. Referring to the Life Sciences application domain, our experience has shown that there exist publicly available import methods for almost all known data sources. In other cases, a quick-and-dirty parser is sufficient for Aladin to use.

The second and third step identify the primary relation and the secondary relations: First, we look for attributes that could serve as accession numbers. This means (following our experiences), their values are at least four characters long, contain at least one character, and must not differ in length more than 20%. Second, we search the data for unary inclusion dependencies (see Spider algorithm) to utilize them (after applying some filtering heuristics) as foreign keys.

We identify a relation as primary relation if (i) it contains an accession number candidate and (ii) the number of INDs referencing any attribute in this relation is maximal in comparison to other relations. All non-primary relations are identified as secondary relation.

Whereas the three preceding steps take place within only one source at a time (intra-source), we finally concentrate on the inter-source level in the fourth and fifth step. This way, we can find cross-references to objects in other data sources and duplicates, i.e., objects representing the same real world object, respectively. Enabling us to filter redundant information and to combine complementary one, these steps conclude the integration process.

Architecture

The architecture of Aladin can be seen in the lower figure. Viewed from bottom to top one can recognise the integration steps in the lower part - namely data import and algorithms working intra- and inter-source specific.

We plan to provide three different modes of accessing the integrated data: browsing, searching or querying. Browsing allows to jump from object to object via different kinds of links (even across data sources), searching provides a full-text search on all stored data as well as a restricted search and querying offers full SQL queries on the schemata as imported.

References

Alexandra Rostin, Oliver Albrecht, Jana Bauckmann, Felix Naumann, Ulf Leser
A Machine Learning Approach to Foreign Key Discovery.
12th International Workshop on the Web and Databases (WebDB 2009), Providence, Rhode Island.
Jana Bauckmann
Automatically Integrating Life Science Data Sources.
VLDB 2007 PhD Workshop, Vienna, Austria.
Jana Bauckmann, Ulf Leser, Felix Naumann, Véronique Tietz
Efficiently Detecting Inclusion Dependencies.
International Conference on Data Engineering (ICDE 2007), Istanbul, Turkey (poster paper, extended version available as technical report).
Jana Bauckmann, Ulf Leser, Felix Naumann, Joachim Schmid
Data Profiling: Effiziente Fremdschlüsselerkennung mit Aladin.
German Information Quality Conference & Workshop (GIQMC 2006), Bad Soden, Germany, November 2006.
Jana Bauckmann
Efficiently Identifying Inclusion Dependencies in RDBMS.
18. Workshop über Grundlagen von Datenbanken.
Jana Bauckmann, Ulf Leser, Felix Naumann
Efficiently Computing Inclusion Dependencies for Schema Discovery.
Second International Workshop on Database Interoperability (InterDB'06) (with ICDE06), Atlanta.
Ulf Leser, Felix Naumann
(Almost) Hands-Off Information Integration for the Life Sciences.
Conference on Innovative Database Research (CIDR 2005), Asilomar, CA.

Project members

Integration Steps

Architecture

References

Chair

News

06.10.2024 | Paper accepted at EDBT 2025

06.09.2024 | Congratulations Dr. Phillip Wenig

06.09.2024 | Congratulations Dr. Mazhar Hameed!

16.07.2024 | Congratulations Dr. Leon Bornemann-Paulus!

23.05.2024 | Paper accepted at NLDB 2024

Project highlights

People and open positions