Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, UK, October 24-28, 2011
Efficient Similarity Search: Arbitrary Similarity Measures, Arbitrary Composition
Dustin Lange and Felix Naumann
Abstract. Given a (large) set of objects and a query, similarity search aims to find all objects similar to the query. A frequent approach is to define a set of base similarity measures for the different aspects of the objects, and to build light-weight similarity indexes on these measures. To determine the overall similarity of two objects, the results of these base measures are composed, e.g., using simple aggregates or more involved machine learning techniques. We propose the first solution to this search problem that does not place any restrictions on the similarity measures, the composition technique, or the data set size.
We define the query plan optimization problem to determine the best query plan using the similarity indexes. A query plan must choose which individual indexes to access and which thresholds to apply. The plan result should be as complete as possible within some cost threshold. We propose the approximative top neighborhood algorithm, which determines a near-optimal plan while significantly reducing the amount of candidate plans to be considered. An exact version of the algorithm determines the optimal solution. Evaluation on real-world data indicates that both versions clearly outperform a complete search of the query plan space.
Frequency-aware Similarity Measures: Why Arnold Schwarzenegger is Always a Duplicate
Dustin Lange and Felix Naumann
Abstract. Measuring the similarity of two records is a challenging problem, but necessary for fundamental tasks, such as duplicate detection and similarity search. By exploiting frequencies of attribute values, many similarity measures can be improved: In a person table with U.S. citizens, Arnold Schwarzenegger is a very rare name. If we find several Arnold Schwarzeneggers in it, it is very likely that these are duplicates. We are then less strict when comparing other attribute values, such as birth date or address.
We put this intuition to use by partitioning compared record pairs according to frequencies of attribute values. For example, we could create three partitions from our data: Partition 1 contains all pairs with rare names, Partition 2 all pairs with medium frequent names, and Partition 3 all pairs with frequent names. For each partition, we learn a different similarity measure: we apply machine learning techniques to combine a set of base similarity measures into an overall similarity measure. To determine a good partitioning, we compare different partitioning strategies. We achieved best results with a novel algorithm inspired by genetic programming.
We evaluate our approach on real-world data sets from a large credit rating agency and from a bibliography database. We show that our learning approach works well for logistic regression, SVM, and decision trees with significant improvements over (i) learning models that ignore frequencies and (ii) frequency-enriched models without partitioning.
Advancing the Discovery of Unique Column Combinations
Ziawasch Abedjan and Felix Naumann
Abstract. Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the well-known Gordian algorithm and "Apriori-based" algorithms are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCA-Gordian combines the advantages of Gordian and our new algorithm HCA, and it outperforms all previous work in many situations.
Trained Trigger Language Model for Sentence Retrieval in QA: Bridging the Vocabulary Gap
Saeedeh Momtazi and Dietrich Klakow
Abstract. We propose a novel language model for sentence retrieval in Question Answering (QA) systems called trained trigger language model. This model addresses the word mismatch and the data sparsity problems which are the two major problems of information retrieval. Different techniques such as the translation model and class model also proposed to overcome the same problems. Our trained trigger language model, however, uses another approach to this aim and is shown to outperform the exiting models. The proposed model captures pairs of trigger and target words while training on a large corpus. The word pairs are extracted based on both unsupervised and supervised approaches. While in unsupervised models, a large corpus of raw text is used to extract trigger and target terms, supervised models require a corpus of questions and answer sentences. We introduce different notions of triggering for unsupervised and supervised approaches. In addition, we study the impact of corpus size and domain for a supervised model. All notions of trained trigger model are finally used in a language model-based sentence retrieval framework. Our experiments on TREC QA collection verify that the proposed model significantly improves the sentence retrieval performance compared to the state-of-the-art translation model and class model.
Black Swan: Augmenting Statistics with Event Data
Johannes Lorey, Felix Naumann, Benedikt Forchhammer, Andrina Mascher, Peter Retzlaff, Armin ZamaniFarahani, Soeren Discher, Cindy Faehnrich, Stefan Lemme, Thorsten Papenbrock, Robert Christoph Peschel, Stephan Richter, Thomas Stening, Sven Viehmeier
Abstract. A large number of statistical indicators (GDP, life expectancy, income, etc.) collected over long periods of time as well as data on historical events (wars, earthquakes, elections, etc.) are published on the World Wide Web. By augmenting statistical outliers with relevant historical occurrences, we provide a means to observe (and predict) the influence and impact of events. The vast amount and size of available data sets enable the detection of recurring connections between classes of events and statistical outliers with the help of association rule mining. The results of this analysis are published at www.blackswanevents.org and can be explored interactively.
Context and Target Configurations for Mining RDF Data
(poster at SMER Workshop)
Ziawasch Abedjan and Felix Naumann
Abstract. Association rule mining has been widely studied in the context of basket analysis and sale recommendations . In fact, the concept can be applied to any domain with many items or events in which interesting relationships can be inferred from co-occurrence of those items or events in existing subsets (transactions). The increasing amount of Linked Open Data (LOD) in the World Wide Web raises new opportunities and challenges for the data mining community . LOD is often represented in the Resource Description Framework (RDF) data model. In RDF, data is represented by a triple structure consisting of subject, predicate, and object (SPO). Each triple represents a statement/fact. We propose an approach that applies association rule mining at statement level by introducing the concept of mining congurations.