Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

25.07.2012

3 Papers (short) accepted at CIKM 2012

21st ACM International Conference on Information and Knowledge Management (CIKM) will be held from October 29 to November 2, 2012 in Maui, USA.

 

Discovering Conditional Inclusion Dependencies
Jana Bauckmann, Ziawasch Abedjan, Heiko Müller, Ulf Leser and Felix Naumann

LINDA: Distributed Web-of-Data-Scale Entity Matching
Christoph Böhm, Gerard de Melo, Felix Naumann and Gerhard Weikum

Reconciling Ontologies and the Web of Data
Ziawasch Abedjan, Johannes Lorey and Felix Naumann 

 

Abstracts:

Discovering Conditional Inclusion Dependencies
Data dependencies, or integrity constraints, are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. In the last years conditional dependencies have been introduced to analyze and improve data quality. In short, a conditional dependency is a dependency with a limited scope de fined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (Cinds). We generalize the de nition of Cinds, distinguishing covering and complete conditions. We present a new use case for such Cinds showing their value for solving complex data quality tasks. Further, we defi ne quality measures for conditions inspired by precision and recall. We propose efficient algorithms that identify covering and complete conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case.

LINDA: Distributed Web-of-Data-Scale Entity Matching
Linked Data has emerged as a powerful new way of interconnecting structured data sources on the Web to express information about entities and their relationships. In practice, however, the cross-linkage between Linked Data sources is not nearly as extensive as one would hope for. In this paper, we formalize the task of automatically creating ``sameAs'' links to connect equivalent entities across data sources in a globally consistent manner. Our LINDA (Linked Data Alignment) algorithm, provided in a multi-core as well as a distributed version, achieves this link generation by accounting for joint evidence of a match rather than considering potential links individually.Unlike previous approaches, we thus consider the entire Linked Data Web simultaneously. Our algorithm iteratively processes a judiciously constructed graph of weighted candidate links. A series of experiments confirm that our system scales beyond the size of the Billion Triple Challenge dataset and delivers highly accurate results despite the vast heterogeneity of the Linked Data Web.

Reconciling Ontologies and the Web of Data
To integrate Linked Open Data, which originates from various and heterogeneous sources, the use of well-defined ontologies is essential. However, oftentimes the utilization of these ontologies by data publishers diff ers from the intended application envisioned by ontology engineers. This may lead to unspecifi ed properties being used ad-hoc as predicates in RDF triples or it may result in infrequent usage of specif ed properties. These mismatches impede the goals and propagation of the Web of Data as data consumers face difficulties when trying to discover and integrate domainspecifi c information. In this work, we identify and classify common misusage patterns by employing frequency analysis and rule mining. Based on this analysis, we introduce an algorithm to propose suggestions for a data-driven ontology re-engineering workflow, which we evaluate on two large-scale RDF datasets.