Our group includes PostDocs, PhD students, and student assistants, and is headed by Prof. Felix Naumann. If you are interested in joining our team, please contact Felix Naumann.
For bachelor students we offer German lectures on database systems in addition to paper- or project-oriented seminars. Within a one-year bachelor project, students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, and information retrieval enhanced by specialized seminars, master projects and we advise master theses.
Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our datasets and source code.
AbstractMachine-based clustering yields fuzzy results. For example, when detecting duplicates in a dataset, different tools might end up with different clusterings. Eventually, a decision needs to be made, defining which records are in the same cluster, i. e., are duplicates. Such a definitive result is called a Consensus Clustering and can be created by evaluating the clustering attempts against each other and only resolving the disagreements by human experts. Yet, there can be different consensus clusterings, depending on the choice of disagreements presented to the human expert. In particular, they may require a different number of manual inspections. We present a set of strategies to select the smallest set of manual inspections to arrive at a consensus clustering and evaluate their efficiency on a set of real-world and synthetic datasets.
Reach for Gold: An Annealing Standard to Evaluate Duplicate Detection Results. Vogel, Tobias; Heise, Arvid; Draisbach, Uwe; Lange, Dustin; Naumann, Felix in JDIQ (2014). 5(1-2)
AbstractDuplicate detection is the process of identifying multiple but different representations of same real-world objects, which typically involves a large number of comparisons. Partitioning is a well-known technique to avoid many unnecessary comparisons. However, partitioning keys are usually handcrafted, which is tedious and the keys are often poorly chosen. We propose a technique to find suitable blocking keys automatically for a dataset equipped with a gold standard. We then show how to re-use those blocking keys for datasets from similar domains lacking a gold standard. Blocking keys are created based on unigrams, which we extend with length-hints for further improvement. Blocking key creation is accompanied with several comprehensive experiments on large artificial and real-world datasets.
Instance-based "one-to-some" Assignment of Similarity Measures to Attributes. Vogel, Tobias; Naumann, Felix (2011).
AbstractData quality is a key factor for economical success. It is usually defined as a set of properties of data, such as completeness, accessibility, relevance, and conciseness. The latter includes the absence of multiple representations for same real world objects. To avoid such duplicates, there is a wide range of commercial products and customized self-coded software. These programs can be quite expensive both in acquisition and maintenance. In particular, small and medium-sized companies cannot afford these tools. Moreover, it is difficult to set up and tune all necessary parameters in these programs. Recently, web-based applications for duplicate detection have emerged. However, they are not easy to integrate into the local IT landscape and require much manual configuration effort. With DAQS (Data Quality as a Service) we present a novel approach to support duplicate detection. The approach features (1) minimal required user interaction and (2) self-configuration for the provided input data. To this end, each data cleansing task is classified to find out which metadata is available. Next, similarity measures are automatically assigned to the provided records’ attributes and a duplicate detection process is carried out. In this paper we introduce a novel matching approach, called one-to-some or 1:k assignment, to assign similarity measures to attributes. We performed an extensive evaluation on a large training corpus and ten test datasets of address data and achieved promising results.
Projektseminar "Similarity Search Algorithms". Lange, Dustin; Vogel, Tobias; Draisbach, Uwe; Naumann, Felix in Datenbank-Spektrum (2011). 11(1) 51–57.
POSR: A Comprehensive System for Aggregating and Using Web Services (demo). AbuJarour, Mohammed; Craculeac, Mircea; Menge, Falko; Vogel, Tobias; Schwarz, Jan-Felix (2009).
AbstractRecently, the number of public Web Services has been constantly increasing. Nevertheless, consuming Web Services as an end-user is not straightforward, because creating a suitable user interface for consuming a Web Service requires much effort. In this work, we introduce a novel approach where user interface fragments for consuming Web Services are generated automatically, and aggregated and customized by end-users to match their preferences. Users can collaboratively improve the auto-generated user interfaces and share them among each other. Our three main sources of Web Services are explicit registration, automatic identification and collecting over the Web, as well as extraction and generation from existing web applications. We validated our approach by implementing it as a comprehensive system coined “Posr”.
Encapsulating Multi-stepped Web Forms as Web Services. Vogel, Tobias; Kaufer, Frank; Naumann, Felix (2009). 488–497.
AbstractHTML forms are the predominant interface between users and web applications. Many of these applications display a sequence of multiple forms on separate pages, for instance to book a flight or order a DVD. We introduce a method to wrap these multi-stepped forms and offer their individual functionality as a single consolidated Web Service. This Web Service in turn maps input data to the individual forms in the correct order. Such consolidation better enables operation of the forms by applications and provides a simpler interface for human users. To this end we analyze the HTML code and sample user interaction of each page and infer the internal model of the application. A particular challenge is to map semantically same fields across multiple forms and choose meaningful labels for them. Web Service output is parsed from the resulting HTML page. Experiments on different multi-stepped web forms show the feasibility and usefulness of our approach.
Master's Theses
<a href="http://www.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/arbeiten/Thema_Masterarbeit.pdf">Duplicate Detection Across Structured And Unstructured Data</a> - David Sonnabend <br>
Duplicate Detection with CrowdSourcing (e.g. Amazon's Mechanical Turk) - David Wenzel