- Stefan George, Matthias Kohnen, Philipp Langer, Tobias Metzke, Patrick Schulze : Assigning Global Relevance Scores to DBpedia Facts
Knowledge bases have become ubiquitous assets in today‘s Web. They provide access to billions of statements about real-world entities derived from governmental, institutional, product-oriented, bibliographic, bio-chemical and different general-purpose datasets. The sheer amount of statements that can be retrieved for a given entity calls for ranking techniques that return the most salient statements as top results.
We analyse and compare various ranking strategies; some of them synergetically combine complementary aspects such as frequency and inverse frequency with structural features, yet others rely on authority-based measures (e.g., PageRank) and Web-based co-occurrence statistics for entity pairs. A user-based evaluation of all approaches has been conducted on the popular DBpedia knowledge base with statistics derived from an indexed version of the Clueweb corpus.
- Thorsten Papenbrock : Progressive Duplicate Detection
As an integral part of data cleansing, duplicate detection is the process of identifying multiple representations of the same real world entities. In today's businesses, duplicate detection processes need to resolve ever larger datasets in ever shorter time. This causes their efficiency to decrease constantly. Therefore, this thesis presents a novel, progressive duplicate detection workflow. The proposed workflow significantly increase the efficiency of finding duplicates if the execution time is limited. Each component of this workflow maximizes the gain of the overall process within the time available by reporting most results much earlier as a traditional approach. To measure the progressive performance of our algorithms, we also propose a novel quality measure that evaluates the entire runtime behavior of a duplicate detection algorithm. Finally, our experiments show that the progressive algorithms can double the efficiency over time for a traditional duplicate detection approach.
Sven Viehmeier : Incremental Data Profiling
Data profiling subsumes different tasks that obtain statistical and structural information about a dataset. This information is of great value regarding business applications, data management, data integration, and data quality. Existing solutions do not consider changes to the database. However, databases in real-world scenarios change frequently. Hence, constantly up-to-date information is required to make correct decisions. Although existing solutions are highly optimized, they still require a huge amount of time to recalculate the information after a change. In contrast, incremental algorithms update data profiling information more efficiently by utilizing previous results and information about the change. In this thesis, we present the Min-Large algorithm, an incremental solution to update median values. Furthermore, we present the incremental Ecnum algorithm that updates previously discovered unique column combinations. In experiments, our incremental algorithms outperform existing non-incremental solutions by an order of magnitude on various datasets.
- Florian Thomas : Optimierung regelbasierter Duplikaterkennung
Duplikaterkennung bezeichnet Verfahren und Algorithmen zum Finden von Mehrfachrepräsentationen desselben Realweltobjekts. Die klassische Vorgehensweise bei der Erkennung von Duplikaten ist der paarweise Vergleich von Datensätzen mithilfe von Ähnlichkeitsfunktionen. Anhand von Schwellwerten wird entschieden, ob es sich bei einem Paar von Datensätzen um ein Duplikat oder ein Nicht-Duplikat handelt. Um die Qualität zu erhöhen sind mitunter auch Fallunterscheidungen notwendig, durch die die Ähnlichkeitsfunktionen jedoch sehr komplex werden. Daher wurden regelbasierte Ansätze zur Duplikaterkennung entwickelt, mit deren Hilfe sich komplexe domänenspezifische Bedingungen in Regeln formulieren lassen. Die Ausführungsreihenfolge der Regeln kann dabei einen deutlichen Einfluss auf die Laufzeit der Duplikaterkennung haben.
Experimente zeigen, dass sie im schlechtesten Fall doppelt so hoch ist wie mit der laufzeitoptimalen Reihenfolge. Daher präsentiert diese Arbeit einen Berechnungsalgorithmus, der die Ermittlung einer optimalen Regelreihenfolge ermöglicht. Anhand der Eigenschaften der Regeln sowie der Beschaffenheit des Datensatzes wird diese schrittweise berechnet. Die Experimente zeigen, dass der Algorithmus für verschiedene Datensätze hinsichtlich der Laufzeit nahezu optimale Regelreihenfolgen liefert.
Maximilian Jenders : Analyzing and Predicting Viral Tweets
Twitter and other microblogging services have become indispensable sources of information in today's web. Understanding the main factors that make certain pieces of information spread quickly in these platforms can be decisive for the analysis of opinion formation and many other opinion mining tasks.
This paper addresses important questions concerning the spread of information on Twitter. What makes Twitter users retweet a tweet? Is it possible to predict whether a tweet will become "viral", i.e., will be frequently retweeted? To answer these questions we provide an extensive analysis of a wide range of tweet and user features regarding their influence on the spread of tweets. The most impactful features are chosen to build a learning model that predicts viral tweets with high accuracy. All experiments are performed on a real-world dataset, extracted through a public Twitter API based on user IDs from the TREC 2011 microblog corpus.
Wei Wang : Similarity Query Processing Algorithms: Use of Enumeration and Divide and Conquer Techniques
In this talk, I'll first give a brief overview of the research conducted
my group at UNSW, and then focus on recent advances on efficient similarity query processing methods.
A similarity query finds all the objects in the database that are similar to the query object according to a given similarity/distance function and a threshold. It plays a fundamental role in many areas such as data integration, pattern recognition, and biological or chemical informatics.
A key algorithmic challenge is how to execute similarity queries efficiently, and decades of research has contributed many elegantsolutions. The objectives of this talk are to provide an introduction and categorization to the similarity query algorithms based on the enumeration and divide and conquer ideas, with an emphasis on Hamming and edit distance functions.
Johannes Lorey : Detecting SPARQL Query Templates for Data Prefetching
Publicly available Linked Data repositories provide a multitude of information. By utilizing SPARQL, Web sites and services can consume this data and present it in a user-friendly form, e.g., in mash-ups. To gather RDF triples for this task, machine agents typically issue similarly structured queries with recurring patterns against the SPARQL endpoint. These queries usually differ only in a small number of individual triple pattern parts, such as resource labels or literals in objects. We present an approach to detect such recurring patterns in queries and introduce the notion of query templates, which represent clusters of similar queries exhibiting these recurrences. We describe a matching algorithm to extract query templates and illustrate the benefits of prefetching data by utilizing these templates. Finally, we comment on the applicability of our approach using results from real-world SPARQL query logs.
Johannes Lorey : Caching and Prefetching Strategies for SPARQL Queries
Linked Data repositories offer a wealth of structured facts, useful for a wide array of application scenarios. However, retrieving this data using SPARQL queries yields a number of challenges, such as limited endpoint capabilities and availability, or high latency for connecting to it. To cope with these challenges, we argue that it is advantageous to cache data that is relevant for future information needs. However, instead of only retaining results of previously issued queries, we aim at retrieving data that is potentially interesting for subsequent requests in advance. To this end, we present different methods to modify the structure of a query so that the altered query can be used to retrieve additional related information. We evaluate these approaches by applying them to requests found in real-world SPARQL query logs.
Maximilian Schneider, Martin Kreichgauer, Thorben Lindhauer, Sebastian Kruse, Nils Rethmeier, Fabian Tschirnitz, Robin Schreiber, Toni Mattis, Wiradarma Pratama, Uwe Hartmann : Advanced Recommendation Techniques
While the information available on the internet increases constantly, recommendation techniques become more imporant in order to efficiently find similar information. In a Master's seminar, students were given an overview on state-of-the-art recommendation techniques and methods for dealing with large and spare user-item matrices. The students then implemented and present scalable recommendation algorithms which are evaluated on a large-scale dataset extracted from various social news sites.