29.06.2012

3 Papers accepted at VLDB Workshops

DBRank 2012 – 6th International Workshop on Ranking in Databases, in conjunction with VLDB 2012

Scalable Similarity Search with Dynamic Similarity Measures

Martin Köppelmann, Dustin Lange, Claudia Lehmann, Marika Marszalkowski, Felix Naumann, Peter Retzlaff, Sebastian Stange, and Lea Voget

Similarity search on structured data assumes some similarity measure on the data - often a combination of individual measures per attribute. Users of a similarity search system may have different requirements on the similarity measure; the individual measures can be combined in many different ways, including a simple weighted sum of the similarities with varying weights, or, at the other end of the spectrum, much more complex machine learning techniques. Previous approaches to similarity search work only with static similarity measures. As the created similarity indexes rely on the static measures, any similarity query is restricted to this predefined measure.

In this paper, we present the DySim algorithm, a novel approach that answers similarity queries with query-specific configurations of similarity measures: For any query, users are allowed to define an arbitrary, including non-metric, overall similarity measure that is based on similarities of attribute values. This freedom allows to provide different search experiences to different user groups. Our approach creates a similarity index for each individual attribute. We then dynamically generate a query plan by using positive ad-hoc sample results and apply a filter-and-refine approach to quickly retrieve the list of results. We evaluated our approach on a large real-world data set. Compared to a baseline algorithm, the DySim algorithm needs about one third less comparisons to retrieve data at the cost of only a very minor decline in recall.

This paper is the result of a masterproject at HPI.

BigData - 2nd International Workshop on End-to-end Management of Big Data, in conjunction with VLDB 2012

Meteor/Sopremo: An Extensible Query Language and Operator Model

Arvid Heise, Astrid Rheinlaender, Marcus Leich, Ulf Leser, and Felix Naumann

Recently, quite a few query and scripting languages for Map-Reduce-based systems have been developed to ease formulating complex data analysis tasks. However, existing tools mainly provide basic operators for rather simple analyses, such as aggregating or filtering. Analytic functionality for advanced applications, such as data cleansing or information extraction can only be embedded in user-defined functions where the semantics is hidden from the query compiler and optimizer. In this paper, we present a language that treats application-specific functions as first-class operators, so that operator semantics can be evaluated and exploited for optimization at compile time.

We present Sopremo, a semantically rich operator model, and Meteor, an extensible query language that is grounded in Sopremo. Sopremo also provides a programming framework that allows users to easily develop and integrate extensions with their respective operators and instantiations. Meteor's syntax is operator-oriented and uses a Json-like data model to support applications that analyze semi- and unstructured data. Meteor queries are translated into data flow programs of operator instantiations, i.e., concrete implementations of the involved Sopremo operators. Using a real-world example, we show how operators from different applications can be combined for writing complex analytical queries.

QDB - 10th International Workshop on Quality in Databases, in conjunction with VLDB 2012

Automatic Blocking Key Selection for Duplicate Detection based on Unigram Combinations

Tobias Vogel and Felix Naumann

Duplicate detection is the process of identifying multiple but different representations of same real-world objects, which typically involves a large number of comparisons. Partitioning is a well-known technique to avoid many unnecessary comparisons. However, partitioning keys are usually handcrafted, which is tedious and the keys are often poorly chosen.

We propose a technique to find suitable blocking keys automatically for a dataset equipped with a gold standard. We then show how to re-use those blocking keys for datasets
from similar domains lacking a gold standard. Blocking keys are created based on unigrams, which we extend with length-hints for further improvement. Blocking key creation
is accompanied with several comprehensive experiments on large artificial and real-world datasets.