Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Arvid Heise

Former PhD student

Email: Arvid Heise

Research Activities

  • Cloud Computing
  • Parallel and Declarative Data Cleansing
  • MapReduce with Hadoop

Publications

Integrating Open Government Data with Stratosphere for more Transparency

Arvid Heise, Felix Naumann
Web Semantics: Science, Services and Agents on the World Wide Web, vol. 14(1):45 - 56 2012

DOI: 10.1016/j.websem.2012.02.002

Abstract:

Governments are increasingly publishing their data to enable organizations and citizens to browse and analyze thedata. However, the heterogeneity of this Open Government Data hinders meaningful search, analysis, and integrationand thus limits the desired transparency.In this article, we present the newly developed data integration operators of the Stratosphere parallel data analysisframework to overcome the heterogeneity. With declaratively specified queries, we demonstrate the integration ofwell-known government data sources and other large open data sets at technical, structural, and semantic levels.Furthermore, we publish the integrated data on theWeb in a form that enables users to discover relationships betweenpersons, government agencies, funds, and companies. The evaluation shows that linking person entities of dierentdata sets results in a good precision of 98.3% and a recall of 95.2%. Moreover, the integration of large data sets scaleswell on up to eight machines.

Keywords:

Data integration, data cleansing, schema mapping, record linkage, data fusion, parallel query processing, map-reduce

BibTeX file

@article{HeiseNaumann2012,
author = { Arvid Heise, Felix Naumann },
title = { Integrating Open Government Data with Stratosphere for more Transparency },
journal = { Web Semantics: Science, Services and Agents on the World Wide Web },
year = { 2012 },
volume = { 14 },
number = { 1 },
pages = { 45 - 56 },
month = { 0 },
abstract = { Governments are increasingly publishing their data to enable organizations and citizens to browse and analyze thedata. However, the heterogeneity of this Open Government Data hinders meaningful search, analysis, and integrationand thus limits the desired transparency.In this article, we present the newly developed data integration operators of the Stratosphere parallel data analysisframework to overcome the heterogeneity. With declaratively specified queries, we demonstrate the integration ofwell-known government data sources and other large open data sets at technical, structural, and semantic levels.Furthermore, we publish the integrated data on theWeb in a form that enables users to discover relationships betweenpersons, government agencies, funds, and companies. The evaluation shows that linking person entities of dierentdata sets results in a good precision of 98.3% and a recall of 95.2%. Moreover, the integration of large data sets scaleswell on up to eight machines. },
keywords = { Data integration, data cleansing, schema mapping, record linkage, data fusion, parallel query processing, map-reduce },
publisher = { Elsevier },
issn = { 1570-8268 },
priority = { 0 }
}

Copyright Notice

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

last change: Fri, 17 Apr 2015 11:19:37 +0200