Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Project members

Aladin (ALmost Automatic Data Integration) is a database integration project for automatically integrating databases in the life sciences domain. As opposed to most integration projects, Aladin works not schema-centric but data-centric. This is based on the fact that life science databases exist in various representations with different (or even bad) schema quality. So the same term (e.g. protein) used in two different data sources does not reliably mean that both data source use the same definition on it. Furthermore, attributes are often represented as strings - even for numerical values.

One fundamental aspect for Aladin is based on our experiences with life sciences databases: In the life sciences domain, databases typically contain one major class of data (gene, protein, sequence, etc.) with structured annotations; we call the relation modelling the major class of data primary relation and the relations modelling the annotations as secondary relations. The identifier of the major objects are called accession numbers, and these are used as link targets when databases refer to each other. So, we want to utilize the primary relation and the accession number for automatic database integration.

Integration Steps

The proposed integration process consists of five steps that can be seen in the upper figure: In the first step, the data source that is to be integrated has to be imported into relational format. Referring to the Life Sciences application domain, our experience has shown that there exist publicly available import methods for almost all known data sources. In other cases, a quick-and-dirty parser is sufficient for Aladin to use.

The second and third step identify the primary relation and the secondary relations: First, we look for attributes that could serve as accession numbers. This means (following our experiences), their values are at least four characters long, contain at least one character, and must not differ in length more than 20%. Second, we search the data for unary inclusion dependencies (see Spider algorithm) to utilize them (after applying some filtering heuristics) as foreign keys.

We identify a relation as primary relation if (i) it contains an accession number candidate and (ii) the number of INDs referencing any attribute in this relation is maximal in comparison to other relations. All non-primary relations are identified as secondary relation.

Whereas the three preceding steps take place within only one source at a time (intra-source), we finally concentrate on the inter-source level in the fourth and fifth step. This way, we can find cross-references to objects in other data sources and duplicates, i.e., objects representing the same real world object, respectively. Enabling us to filter redundant information and to combine complementary one, these steps conclude the integration process.

Architecture

The architecture of Aladin can be seen in the lower figure. Viewed from bottom to top one can recognise the integration steps in the lower part - namely data import and algorithms working intra- and inter-source specific.

We plan to provide three different modes of accessing the integrated data: browsing, searching or querying. Browsing allows to jump from object to object via different kinds of links (even across data sources), searching provides a full-text search on all stored data as well as a restricted search and querying offers full SQL queries on the schemata as imported.

References