Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Stratosphere

Stratosphere is a joint DFG project conducted by the Technische Universität Berlin, Humboldt Universität Berlin, and the Hasso-Plattner-Institut. It explores how the elasticity of clouds can be exploited for processing analytic queries massively in parallel. Unlike most traditional DBMS, Stratosphere inherently supports text-based and semi-structured data.

Official Project Site

The sub-projects at HPI focus on data quality improvements of linked open data, efficient and scalable data profiling, and knowledge discoevry.

Data Cleansing

We defined the declarative data cleansing language Meteor, implement the underlying basic operations, and develop cost estimations for the operations. Furthermore, we provide test data sets and example queries to evaluate the efficiency and effectivity of the data cleansing process.

Data Profiling

Detecting dependencies in the evergrowing amounts of data has a high computational complexity. One way to cope with this complexity is to distribute the computational work among multiple interconnected computers. However, most existing data profiling algorithms are not designed for parallel execution on computer clusters but rather to run on a single machine. Therefore, we research distributed modifications of existing algorithms as well as new algorithms that can be efficiently executed on computer clusters and that scale out on the number of the cluster nodes.

Knowledge Discovery

Driven by applications such as social media analytics, Web search, advertising, recommendation, mobile sensoring, genomic sequencing, astronomical observations, etc., the need for scalable learning, mining, and knowledge discovery methods is steadily growing. Often the challenge is to automatically process and analyze TBs of evolving data. Extracting value (e.g., understanding the underlying structure and making predictions) from such data, before it is outdated, is a major concern. Therefore, the goal is to enable the scalability of such applications based on Stratosphere.

Please contact Felix Naumann, Toni Grütze (Knowledge Discovery on Stratosphere), or Sebastian Kruse (Data Profiling on Stratosphere) for further questions.

Former members

Publications

CohEEL: Coherent and Efficient Named Entity Linking through Random Walks

Gruetze, Toni and Kasneci, Gjergji and Zuo, Zhe and Naumann, Felix
Web Semantics: Science, Services and Agents on the World Wide Web, vol. 37(C):75–89 3 2016
http://dx.doi.org/10.1016/j.websem.2016.03.001

DOI: 10.1016/j.websem.2016.03.001

Abstract:

In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to a considerable increase of valuable textual information about entities. Harvesting entity knowledge from these large text collections is a major challenge. It requires the linkage of textual mentions within the documents with their real-world entities. This process is called entity linking. Solutions to this entity linking problem have typically aimed at balancing the rate of linking correctness (precision) and the linking coverage rate (recall). While entity links in texts could be used to improve various Information Retrieval tasks, such as text summarization, document classification, or topic-based clustering, the linking precision is the decisive factor. For example, for topic-based clustering a method that produces mostly correct links would be more desirable than a high-coverage method that leads to more but also more uncertain clusters. We propose an efficient linking method that uses a random walk strategy to combine a precision-oriented and a recall-oriented classifier in such a way that a high precision is maintained, while recall is elevated to the maximum possible level without affecting precision. An evaluation on three datasets with distinct characteristics demonstrates that our approach outperforms seminal work in the area and shows higher precision and time performance than the most closely related state-of-the-art methods.

Keywords:

Entity Linking, Named Entity Disambiguation, Random Walk, Machine Learning

BibTeX file

@article{Gruetze2016,
author = { Gruetze, Toni and Kasneci, Gjergji and Zuo, Zhe and Naumann, Felix },
title = { CohEEL: Coherent and Efficient Named Entity Linking through Random Walks },
journal = { Web Semantics: Science, Services and Agents on the World Wide Web },
year = { 2016 },
volume = { 37 },
number = { C },
pages = { 75--89 },
month = { 3 },
abstract = { In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to a considerable increase of valuable textual information about entities. Harvesting entity knowledge from these large text collections is a major challenge. It requires the linkage of textual mentions within the documents with their real-world entities. This process is called entity linking. Solutions to this entity linking problem have typically aimed at balancing the rate of linking correctness (precision) and the linking coverage rate (recall). While entity links in texts could be used to improve various Information Retrieval tasks, such as text summarization, document classification, or topic-based clustering, the linking precision is the decisive factor. For example, for topic-based clustering a method that produces mostly correct links would be more desirable than a high-coverage method that leads to more but also more uncertain clusters. We propose an efficient linking method that uses a random walk strategy to combine a precision-oriented and a recall-oriented classifier in such a way that a high precision is maintained, while recall is elevated to the maximum possible level without affecting precision. An evaluation on three datasets with distinct characteristics demonstrates that our approach outperforms seminal work in the area and shows higher precision and time performance than the most closely related state-of-the-art methods. },
keywords = { Entity Linking, Named Entity Disambiguation, Random Walk, Machine Learning },
url = { http://dx.doi.org/10.1016/j.websem.2016.03.001 },
publisher = { Elsevier B.V. },
issn = { 1570-8268 },
priority = { 0 }
}

Copyright Notice

last change: Fri, 12 Aug 2016 17:23:06 +0200