-
Bleifuß, T., Bornemann, L., Kalashnikov, D.V., Naumann, F., Srivastava, D.: DBChEx: Interactive Exploration of Data and Schema Change. Proceedings of the Conference on Innovative Data Systems Research (CIDR) (2019).
-
Risch, J., Stoll, A., Ziegele, M., Krestel, R.: hpiDEDIS at GermEval 2019: Offensive Language Identification using a German BERT model. Proceedings of the 15th Conference on Natural Language Processing (KONVENS). p. 403--408. German Society for Computational Linguistics & Language Technology, Erlangen, Germany (2019).
Pre-training language representations on large text corpora, for example, with BERT, has recently shown to achieve impressive performance at a variety of downstream NLP tasks. So far, applying BERT to offensive language identification for German- language texts failed due to the lack of pre-trained, German-language models. In this paper, we fine-tune a BERT model that was pre-trained on 12 GB of German texts to the task of offensive language identification. This model significantly outperforms our baselines and achieves a macro F1 score of 76% on coarse-grained, 51% on fine-grained, and 73% on implicit/explicit classification. We analyze the strengths and weaknesses of the model and derive promising directions for future work.
-
Kruse, S., Kaoudi, Z., Quiané-Ruiz, J.-A., Chawla, S., Naumann, F., Contreras-Rojas, B.: Optimizing Cross-Platform Data Movement. Proceedings of the International Conference on Data Engineering (ICDE) (2019).
-
Jiang, L., Vitagliano, G., Naumann, F.: A Scoring-based Approach for Data Preparator Suggestion. Presented at the (2019).
Self-service data preparation enables end users to prepare data by themselves. However, the plethora of possible data preparation steps can overwhelm the user. We introduce a score-based preparator ranking approach to propose preparator candidates in a context-specific manner. To this end, we give scoring functions for a selected set of preparators and outline future work towards a full-fledged data preparation system.
-
Dürsch, F., Stebner, A., Windheuser, F., Fischer, M., Friedrich, T., Strelow, N., Bleifuß, T., Harmouch, H., Jiang, L., Papenbrock, T., Naumann, F.: Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms. Proceedings of the International Conference on Information and Knowledge Management (CIKM) (2019).
-
Schirmer, P., Papenbrock, T., Kruse, S., Naumann, F., Hempfing, D., Mayer, T., Neuschäfer-Rube, D.: DynFD: Functional Dependency Discovery in Dynamic Datasets. Proceedings of the International Conference on Extending Database Technology (EDBT). p. 253--264 (2019).
Functional dependencies (FDs) support various tasks for the management of relational data, such as schema normalization, data cleaning, and query optimization. However, while existing FD discovery algorithms regard only static datasets, many real-world datasets are constantly changing – and with them their FDs. Unfortunately, the computational hardness of FD discovery prohibits a continuous re-execution of those existing algorithms with every change of the data. To this end, we propose DynFD, the first algorithm to dis cover and maintain functional dependencies in dynamic datasets. Whenever the inspected dataset changes, DynFD evolves its FDs rather than recalculating them. For this to work efficiently, we propose indexed data structures along with novel and efficient update operations. Our experiments compare DynFD’s incremental mode of operation to the repeated re-execution of existing, static algorithms. They show that DynFD can maintain the FDs of dynamic datasets over an order of magnitude faster than its static counter-parts.
-
Risch, J., Krestel, R.: Toxic Comment Detection in Online Discussions. In: Agarwal, B., Nayak, R., Mittal, N., and Patnaik, S. (eds.) Deep Learning Based Approaches for Sentiment Analysis. Springer (2019).
With the exponential growth in the use of social media networks such as Twitter, Facebook, and many others, an astronomical amount of big data has been generated. A substantial amount of this user-generated data is in form of text such as reviews, tweets, and blogs that provide numerous challenges as well as opportunities to NLP (Natural Language Processing) researchers for discovering meaningful information used in various applications. Sentiment analysis is the study that analyses people’s opinion and sentiment towards entities such as products, services, person, organisations etc. present in the text. Sentiment analysis and opinion mining is the most popular and interesting research problem. In recent years, Deep Learning approaches have emerged as powerful computational models and have shown significant success to deal with a massive amount of data in unsupervised settings. Deep learning is revolutionizing because it offers an effective way of learning representation and allows the system to learn features automatically from data without the need of explicitly designing them. Deep learning algorithms such as deep autoencoders, convolutional and recurrent neural networks (CNN) (RNN), Long Short-Term Memory (LSTM) and Generative Adversarial Networks (GAN) have reported providing significantly improved results in various natural language processing tasks including sentiment analysis.
-
Naumann, F.: The relational database management systems genealogy. In: Brodie, M.L. (ed.) Making Databases Work. pp. 173-179. ACM / Morgan & Claypool (2019).
-
Pena, E.H.M., de Almeida, E.C., Naumann, F.: Discovery of Approximate (and Exact) Denial Constraints. PVLDB. 13, (2019).
Maintaining data consistency is known to be hard. Recent approaches have relied on integrity constraints to deal with the problem - correct and complete constraints naturally work towards data consistency. State-of-the-art data cleaning frameworks have used the formalism known as denial constraint (DC) to handle a wide range of real-world constraints. Each DC expresses a relationship between predicates that indicate which combinations of attribute values are inconsistent. The design of DCs, however, must keep pace with the complexity of data and applications. The alternative to designing DCs by hand is automatically discovering DCs from data, which is computationally expensive due to the large search space of DCs. To tackle this challenging task, we present a novel algorithm to efficiently discover DCs: DCFinder. The algorithm combines data structures called position list indexes with techniques based on predicate selectivity to efficiently validate DC candidates. Because the available data often contain errors, DCFinder is especially designed to discovering approximate DCs, i.e., DCs that may partially hold. Our experimental evaluation uses real and synthetic datasets, and shows that DCFinder outperforms all the existing approximate DC discovery algorithms.
-
Risch, J., Krestel, R.: Domain-specific word embeddings for patent classification. Data Technologies and Applications. 53, 108-122 (2019).
Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. To examine the novelty of an application it can then be compared to previously granted patents in the same class. Automatic classification would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed to accomplish this costly manual task. However, a challenge for the automation is patent-specific language use, such as special vocabulary and phrases. To account for this language use, we present domain-specific pre-trained word embeddings for the patent domain. We train our model on a very large dataset of more than 5 million patents and evaluate it at the task of patent classification. To this end, we propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings. Experiments on a standardized evaluation dataset show that our approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches. In this paper, we further investigate the model’s strengths and weaknesses. An extensive error analysis reveals that the learned embeddings indeed mirror patent-specific language use. The imbalanced training data and underrepresented classes are the most difficult remaining challenge.
-
Risch, J., Krestel, R.: Measuring and Facilitating Data Repeatability in Web Science. Datenbank-Spektrum. 19, 117-126 (2019).
Accessible and reusable datasets are a necessity to accomplish repeatable research. This requirement poses a problem particularly for web science, since scraped data comes in various formats and can change due to the dynamic character of the web. Further, usage of web data is typically restricted by copyright-protection or privacy regulations, which hinder publication of datasets. To alleviate these problems and reach what we define as “partial data repeatability”, we present a process that consists of multiple components. Researchers need to distribute only a scraper and not the data itself to comply with legal limitations. If a dataset is re-scraped for repeatability after some time, the integrity of different versions can be checked based on fingerprints. Moreover, fingerprints are sufficient to identify what parts of the data have changed and how much. We evaluate an implementation of this process with a dataset of 250 million online comments collected from five different news discussion platforms. We re-scraped the dataset after pausing for one year and show that less than ten percent of the data has actually changed. These experiments demonstrate that providing a scraper and fingerprints enables recreating a dataset and supports the repeatability of web science experiments.
-
Jiang, L., Naumann, F.: Holistic Primary Key and Foreign Key Detection. Journal of Intelligent Information Systems. (2019).
Primary keys (PKs) and foreign keys (FKs) are important elements of relational schemata in various applications, such as query optimization and data integration. However, in many cases, these constraints are unknown or not documented. Detecting them manually is time-consuming and even infeasible in large-scale datasets. We study the problem of discovering primary keys and foreign keys automatically and propose an algorithm to detect both, namely Holistic Primary Key and Foreign Key Detection (HoPF). PKs and FKs are subsets of the sets of unique column combinations (UCCs) and inclusion dependencies (INDs), respectively, for which efficient discovery algorithms are known. Using score functions, our approach is able to effectively extract the true PKs and FKs from the vast sets of valid UCCs and INDs. Several pruning rules are employed to speed up the procedure. We evaluate precision and recall on three benchmarks and two real-world datasets. The results show that our method is able to retrieve on average 88% of all primary keys, and 91% of all foreign keys. We compare the performance of algor with two baseline approaches that both assume the existence of primary keys.
-
Kellermeier, T., Repke, T., Krestel, R.: Mining Business Relationships from Stocks and News. MIDAS@ECML-PKDD. (2019).
In today’s modern society and global economy, decision making processes are increasingly supported by data. Especially in financial businesses it is essential to know about how the players in our global or national market are connected. In this work we compare different approaches for creating company relationship graphs. In our evaluation we see similarities in relationships extracted from Bloomberg and Reuters business news and correlations in historic stock market data.
-
Jain, N., Krestel, R.: Who is Mona L.? Identifying Mentions of Artworks in Historical Archives. Lecture Notes in Computer Science, Springer. 11799, 115--122 (2019).
Named entity recognition (NER) plays an important role in many information retrieval tasks, including automatic knowledge graph construction. Most NER systems are typically limited to a few common named entity types, such as person, location, and organization. However, for cultural heritage resources, such as art historical archives, the recognition of titles of artworks as named entities is of high importance. In this work, we focus on identifying mentions of artworks, e.g. paintings and sculptures, from historical archives. Current state of the art NER tools are unable to adequately identify artwork titles due to the particular difficulties presented by this domain. The scarcity of training data for NER for cultural heritage poses further hindrances. To mitigate this, we propose a semi-supervised approach to create high-quality training data by leveraging existing cultural heritage resources. Our experimental evaluation shows significant improvement in NER performance for artwork titles as compared to baseline approach.
-
Draisbach, U., Christen, P., Naumann, F.: Transforming Pairwise Duplicates to Entity Clusters for High Quality Duplicate Detection. ACM Journal on Data and Information Quality (JDIQ). (2019).
-
Repke, T., Krestel, R., Edding, J., Hartmann, M., Hering, J., Kipping, D., Schmidt, H., Scordialo, N., Zenner, A.: Beacon in the Dark: A System for Interactive Exploration of Large Email Corpora. Proceedings of the International Conference on Information and Knowledge Management (CIKM). p. 1--4. ACM (2018).
Emails play a major role in today's business communication, documenting not only work but also decision making processes. The large amount of heterogeneous data in these email corpora renders manual investigations by experts infeasible. Auditors or jornalists, e.g., who are looking for irregular or inappropriate content or suspicous patterns, are in desperate need for computer-aided exploration tools to support their investigations. We present our Beacon system for the exploration of such corpora at different levels of detail. A distributed processing pipeline combines text mining methods and social network analysis to augment the already semi-structured nature of emails. The user interface ties into the resulting cleaned and enriched dataset. For the interface design we identify three objectives expert users have: gain an initial overview of the data to identify leads to investigate, understand the context of the information at hand, and have meaningful filters to iteratively focus onto a subset of emails. To this end we make use of interactive visualisations for rearranging and aggregating the extracted information to reveal salient patterns.
-
van Aken, B., Risch, J., Krestel, R., Löser, A.: Challenges for Toxic Comment Classification: An In-Depth Error Analysis. Proceedings of the 2nd Workshop on Abusive Language Online (co-located with EMNLP). pp. 33-42 (2018).
Toxic comment classification has become an active research field with many recently proposed approaches. However, while these approaches address some of the task’s challenges others still remain unsolved and directions for further research are needed. To this end, we compare different approaches on a new, large comment dataset and propose an ensemble that outperforms all individual models. Further, we validate our findings on a second dataset. The results of the ensemble enable us to perform an extensive error analysis, which reveals open challenges for state-of- the-art methods and directions towards pending future research. These challenges include missing paradigmatic context and inconsistent dataset labels.
-
Risch, J., Krestel, R.: Learning Patent Speak: Investigating Domain-Specific Word Embeddings. Proceedings of the Thirteenth International Conference on Digital Information Management (ICDIM). pp. 63-68 (2018).
A patent examiner needs domain-specific knowledge to classify a patent application according to its field of invention. Standardized classification schemes help to compare a patent application to previously granted patents and thereby check its novelty. Due to the large volume of patents, automatic patent classification would be highly beneficial to patent offices and other stakeholders in the patent domain. However, a challenge for the automation of this costly manual task is the patent-specific language use. To facilitate this task, we present domain-specific pre-trained word embeddings for the patent domain. We trained our model on a very large dataset of more than 5 million patents to learn the language use in this domain. We evaluated the quality of the resulting embeddings in the context of patent classification. To this end, we propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings. Experiments on a standardized evaluation dataset show that our approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches.
-
Risch, J., Krebs, E., Löser, A., Riese, A., Krestel, R.: Fine-Grained Classification of Offensive Language. Proceedings of GermEval (co-located with KONVENS). pp. 38-44 (2018).
Social media platforms receive massive amounts of user-generated content that may include offensive text messages. In the context of the GermEval task 2018, we propose an approach for fine-grained classification of offensive language. Our approach comprises a Naive Bayes classifier, a neural network, and a rule-based approach that categorize tweets. In addition, we combine the approaches in an ensemble to overcome weaknesses of the single models. We cross-validate our approaches with regard to macro-average F1-score on the provided training dataset.
-
Risch, J., Garda, S., Krestel, R.: Book Recommendation Beyond the Usual Suspects: Embedding Book Plots Together with Place and Time Information. Proceedings of the 20th International Conference On Asia-Pacific Digital Libraries (ICADL). pp. 227-239 (2018).
Content-based recommendation of books and other media is usually based on semantic similarity measures. While metadata can be compared easily, measuring the semantic similarity of narrative literature is challenging. Keyword-based approaches are biased to retrieve books of the same series or do not retrieve any results at all in sparser libraries. We propose to represent plots with dense vectors to foster semantic search for similar plots even if they do not have any words in common. Further, we propose to embed plots, places, and times in the same embedding space. Thereby, we allow arithmetics on these aspects. For example, a book with a similar plot but set in a different, user-specified place can be retrieved. We evaluate our findings on a set of 16,000 book synopses that spans literature from 500 years and 200 genres and compare our approach to a keyword-based baseline.
-
Pietrangelo, A., Simonini, G., Bergamaschi, S., Naumann, F., Koumarelas, I.: Towards Progressive Search-driven Entity Resolution. SEBD (2018).
Keyword-search systems for databases aim to answer a user query composed of a few terms with a ranked list of records. They are powerful and easy-to-use data exploration tools for a wide range of contexts. For instance, given a product database gathered scraping e-commerce websites, these systems enable even non-technical users to explore the item set (e.g., to check whether it contains certain products or not, or to discover the price of an item). However, if the database contains dirty records (i.e., incomplete and duplicated records), a preprocessing step to clean the data is required. One fundamental data cleaning step is Entity Resolution, i.e., the task of identifying and fusing together all the records that refer to the same real-word entity. This task is typically executed on the whole data, independently of: (i) the portion of the entities that a user may indicate through keywords, and (ii) the order priority that a user might express through an order by clause. This paper describes a first step to solve the problem of progressive search-driven Entity Resolution: resolving all the entities described by a user through a handful of keywords, progressively (according to an order by clause). We discuss the features of our method, named SearchER and showcase some examples of keyword queries on two real-world datasets obtained with a demonstrative prototype that we have built.
-
Loster, M., Repke, T., Krestel, R., Naumann, F., Ehmueller, J., Feldmann, B., Maspfuhl, O.: The Challenges of Creating, Maintaining and Exploring Graphs of Financial Entities. Proceedings of the Fourth International Workshop on Data Science for Macro-Modeling (DSMM 2018). ACM (2018).
The integration of a wide range of structured and unstructured information sources into a uniformly integrated knowledge base is an important task in the financial sector. As an example, modern risk analysis methods can benefit greatly from an integrated knowledge base, building in particular a dedicated, domain-specific knowledge graph. Knowledge graphs can be used to gain a holistic view of the current economic situation so that systemic risks can be identified early enough to react appropriately. The use of this graphical structure thus allows the investigation of many financial scenarios, such as the impact of corporate bankruptcy on other market participants within the network. In this particular scenario, the links between the individual market participants can be used to determine which companies are affected by a bankruptcy and to what extent. We took these considerations as a motivation to start the development of a system capable of constructing and maintaining a knowledge graph of financial entities and their relationships. The envisioned system generates this particular graph by extracting and combining information from both structured data sources such as Wikidata and DBpedia, as well as from unstructured data sources such as newspaper articles and financial filings. In addition, the system should incorporate proprietary data sources, such as financial transactions (structured) and credit reports (unstructured). The ultimate goal is to create a system that recognizes financial entities in structured and unstructured sources, links them with the information of a knowledge base, and then extracts the relations expressed in the text between the identified entities. The constructed knowledge base can be used to construct the desired knowledge graph. Our system design consists of several components, each of which addresses a specific subproblem. To this end, Figure 1 gives a general overview of our system and its subcomponents.
-
Loster, M., Hegner, M., Naumann, F., Leser, U.: Dissecting Company Names using Sequence Labeling. Proceedings of the Conference "Lernen, Wissen, Daten, Analysen". pp. 227-238 (2018).
Understanding the inherent structure of company names by identifying their constituent parts yields valuable insights that can be leveraged by other tasks, such as named entity recognition, data cleansing, or deduplication. Unfortunately, segmenting company names poses a hard problem due to their high structural heterogeneity. Besides obvious elements, such as the core name or legal form, company names often contain additional elements, such as personal and location names, abbreviations, and other unexpected elements. While others have addressed the segmentation of person names, we are the first to address the segmentation of the more complex company names. We present a solution to the problem of automatically labeling the constituent name parts and their semantic role within German company names. To this end we propose and evaluate a collection of novel features used with a conditional random field classifier. In identifying the constituent parts of company names we achieve an accuracy of 84%, while classifying the colloquial names resulted in an F1 measure of 88%.
-
Loster, M., Naumann, F., Ehmueller, J., Feldmann, B.: CurEx: A System for Extracting, Curating, and Exploring Domain-Specific Knowledge Graphs from Text. Proceedings of the ACM International Conference on Information and Knowledge Management. pp. 1883-1886. ACM (2018).
The integration of diverse structured and unstructured information sources into a unified, domain-specific knowledge base is an important task in many areas. A well-maintained knowledge base enables data analysis in complex scenarios, such as risk analysis in the financial sector or investigating large data leaks, such as the Paradise or Panama papers. Both the creation of such knowledge bases, as well as their continuous maintenance and curation involves many complex tasks and considerable manual effort. With CurEx, we present a modular system that allows structured and unstructured data sources to be integrated into a domainspecific knowledge base. In particular, we (i) enable the incremental improvement of each individual integration component; (ii) enable the selective generation of multiple knowledge graphs from the information contained in the knowledge base; and (iii) provide two distinct user interfaces tailored to the needs of data engineers and end-users respectively. The former has curation capabilities and controls the integration process, whereas the latter focuses on the exploration of the generated knowledge graph.
-
Bunk, S., Krestel, R.: WELDA: Enhancing Topic Models by Incorporating Local Word Contexts. Joint Conference on Digital Libraries (JCDL 2018). ACM, Forth Worth, Texas, USA (2018).
The distributional hypothesis states that similar words tend to have similar contexts in which they occur. Word embedding models exploit this hypothesis by learning word vectors based on the local context of words. Probabilistic topic models on the other hand utilize word co-occurrences across documents to identify topically related words. Due to their complementary nature, these models define different notions of word similarity, which, when combined, can produce better topical representations. In this paper we propose WELDA, a new type of topic model, which combines word embeddings (WE) with latent Dirichlet allocation (LDA) to improve topic quality. We achieve this by estimating topic distributions in the word embedding space and exchanging selected topic words via Gibbs sampling from this space. We present an extensive evaluation showing that WELDA cuts runtime by at least 30% while outperforming other combined approaches with respect to topic coherence and for solving word intrusion tasks.
-
Exeler, C., Graber, M., Junge, T., Ramson, S., Ramson, C., Tschirschnitz, F., Naumann, F.: Piggyback Profiling: Enhancing Query Results with Metadata. Lernen. Wissen. Daten. Analysen. (LWDA) (2018).
-
Risch, J., Krestel, R.: Delete or not Delete? Semi-Automatic Comment Moderation for the Newsroom. Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (co-located with COLING). pp. 166-176 (2018).
Comment sections of online news providers have enabled millions to share and discuss their opinions on news topics. Today, moderators ensure respectful and informative discussions by deleting not only insults, defamation, and hate speech, but also unverifiable facts. This process has to be transparent and comprehensive in order to keep the community engaged. Further, news providers have to make sure to not give the impression of censorship or dissemination of fake news. Yet manual moderation is very expensive and becomes more and more unfeasible with the increasing amount of comments. Hence, we propose a semi-automatic, holistic approach, which includes comment features but also their context, such as information about users and articles. For evaluation, we present experiments on a novel corpus of 3 million news comments annotated by a team of professional moderators.
-
Risch, J., Krestel, R.: Aggression Identification Using Deep Learning and Data Augmentation. Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (co-located with COLING). pp. 150-158 (2018).
Social media platforms allow users to share and discuss their opinions online. However, a minority of user posts is aggressive, thereby hinders respectful discussion, and — at an extreme level — is liable to prosecution. The automatic identification of such harmful posts is important, because it can support the costly manual moderation of online discussions. Further, the automation allows unprecedented analyses of discussion datasets that contain millions of posts. This system description paper presents our submission to the First Shared Task on Aggression Identification. We propose to augment the provided dataset to increase the number of labeled comments from 15,000 to 60,000. Thereby, we introduce linguistic variety into the dataset. As a consequence of the larger amount of training data, we are able to train a special deep neural net, which generalizes especially well to unseen data. To further boost the performance, we combine this neural net with three logistic regression classifiers trained on character and word n-grams, and hand-picked syntactic features. This ensemble is more robust than the individual single models. Our team named “Julian” achieves an F1-score of 60% on both English datasets, 63% on the Hindi Facebook dataset, and 38% on the Hindi Twitter dataset.
-
Repke, T., Krestel, R.: Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks. 40th European Conference on Information Retrieval (ECIR 2018). Springer, Grenoble, France (2018).
Email communication plays an integral part of everybody's life nowadays. Especially for business emails, extracting and analysing these communication networks can reveal interesting patterns of processes and decision making within a company. Fraud detection is another application area where precise detection of communication networks is essential. In this paper we present an approach based on recurrent neural networks to untangle email threads originating from forward and reply behaviour. We further classify parts of emails into 2 or 5 zones to capture not only header and body information but also greetings and signatures. We show that our deep learning approach outperforms state-of-the-art systems based on traditional machine learning and hand-crafted rules. Besides using the well-known Enron email corpus for our experiments, we additionally created a new annotated email benchmark corpus from Apache mailing lists.
-
Berti-Equille, L., Harmouch, H., Naumann, F., Novelli, N., Thirumuruganathan, S.: Discovery of Genuine Functional Dependencies from Relational Data with Missing Values. Proceedings of the VLDB Endowment (PVLDB). pp. 880-892 (2018).
Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.
-
Risch, J., Krestel, R.: My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections. Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL). pp. 283-292 (2018).
Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such as digital libraries. However, topic modeling on documents from different collections is challenging because of domain-specific vocabulary. We present a cross-collection topic model combined with automatic domain term extraction and phrase segmentation. This model distinguishes collection-specific and collection-independent words based on information entropy and reveals commonalities and differences of multiple text collections. We evaluate our model on patents, scientific papers, newspaper articles, forum posts, and Wikipedia articles. In comparison to state-of-the-art cross-collection topic modeling, our model achieves up to 13% higher topic coherence, up to 4% lower perplexity, and up to 31% higher document classification accuracy. More importantly, our approach is the first topic model that ensures disjunct general and specific word distributions, resulting in clear-cut topic representations.
-
Repke, T., Krestel, R.: Topic-aware Network Visualisation to Explore Large Email Corpora. International Workshop on Big Data Visual Exploration and Analytics (BigVis). CEUR-WS.org (2018).
Nowadays, more and more large datasets exhibit an intrinsic graph structure. While there exist special graph databases to handle ever increasing amounts of nodes and edges, visualising this data becomes infeasible quickly with growing data. In addition, looking at its structure is not sufficient to get an overview of a graph dataset. Indeed, visualising additional information about nodes or edges without cluttering the screen is essential. In this paper, we propose an interactive visualisation for social networks that positions individuals (nodes) on a two-dimensional canvas such that communities defined by social links (edges) are easily recognisable. Furthermore, we visualise topical relatedness between individuals by analysing information about social links, in our case email communication. To this end, we utilise document embeddings, which project the content of an email message into a high dimensional semantic space and graph embeddings, which project nodes in a network graph into a latent space reflecting their relatedness.
-
Lazaridou, K., Gruetze, T., Naumann, F.: Where in the World Is Carmen Sandiego? Detecting Person Locations via Social Media Discussions. Proceedings of the ACM Conference on Web Science. ACM (2018).
In today's social media, news often spread faster than in mainstream media, along with additional context and aspects about the current affairs. Consequently, users in social networks are up-to-date with the details of real-world events and the involved individuals. Examples include crime scenes and potential perpetrator descriptions, public gatherings with rumors about celebrities among the guests, rallies by prominent politicians, concerts by musicians, etc. We are interested in the problem of tracking persons mentioned in social media, namely detecting the locations of individuals by leveraging the online discussions about them. Existing literature focuses on the well-known and more convenient problem of user location detection in social media, mainly as the location discovery of the user profiles and their messages. In contrast, we track individuals with text mining techniques, regardless whether they hold a social network account or not. We observe what the community shares about them and estimate their locations. Our approach consists of two steps: firstly, we introduce a noise filter that prunes irrelevant posts using a recursive partitioning technique. Secondly, we build a model that reasons over the set of messages about an individual and determines his/her locations. In our experiments, we successfully trace the last U.S. presidential candidates through millions of tweets published from November 2015 until January 2017. Our results outperform previously introduced techniques and various baselines.
-
Ambroselli, C., Risch, J., Krestel, R., Loos, A.: Prediction for the Newsroom: Which Articles Will Get the Most Comments? Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). pp. 193-199. ACL, New Orleans, Louisiana, USA (2018).
The overwhelming success of the Web and mobile technologies has enabled millions to share their opinions publicly at any time. But the same success also endangers this freedom of speech due to closing down of participatory sites misused by individuals or interest groups. We propose to support manual moderation by proactively drawing the attention of our moderators to article discussions that most likely need their intervention. To this end, we predict which articles will receive a high number of comments. In contrast to existing work, we enrich the article with metadata, extract semantic and linguistic features, and exploit annotated data from a foreign language corpus. Our logistic regression model improves F1-scores by over 80% in comparison to state-of-the-art approaches.
-
Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. Morgan & Claypool Publishers (2018).
-
Sadiq, S., Dasu, T., Dong, X.L., Freire, J., Ilyas, I.F., Link, S., Miller, R.J., Naumann, F., Zhou, X., Srivastava, D.: Data Quality – The Role of Empiricism. SIGMOD Record. 46, 35-43 (2018).
-
Agrawal, D., Chawla, S., Kaoudi, Z., Kruse, S., Quiané-Ruiz, J.A., Contreras-Rojas, B., Elmagarmid, A., Idris, Y., Lucas, J., Mansour, E., Ouzzani, M., Papotti, P., Tang, N., Thirumuruganathan, S., Troudi, A.: RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! -. Proceedings of the VLDB Endowment (PVLDB). 11, (2018).
Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and data across different platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging: finding the most efficient platform for a given task requires quite good expertise for all the available platforms. We present Rheem, a general-purpose cross-platform data processing system that decouples applications from the underlying platforms. It not only determines the best platform to run an incoming task, but also splits the task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). It features (i) a robust interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Using different real-world applications with Rheem, we demonstrate how cross-platform data processing can accelerate performance by more than one order of magnitude compared to single-platform data processing.
-
Koumarelas, I., Kroschk, A., Mosley, C., Naumann, F.: Experience: Enhancing Address Matching with Geocoding and Similarity Measure Selection. J. Data and Information Quality. 10, 8:1--8:16 (2018).
Given a query record, record matching is the problem of finding database records that represent the same real-world object. In the easiest scenario, a database record is completely identical to the query. However, in most cases, problems do arise, for instance, as a result of data errors or data integrated from multiple sources or received from restrictive form fields. These problems are usually difficult, because they require a variety of actions, including field segmentation, decoding of values, and similarity comparisons, each requiring some domain knowledge. In this article, we study the problem of matching records that contain address information, including attributes such as Street-address and City. To facilitate this matching process, we propose a domain-specific procedure to, first, enrich each record with a more complete representation of the address information through geocoding and reverse-geocoding and, second, to select the best similarity measure per each address attribute that will finally help the classifier to achieve the best f-measure. We report on our experience in selecting geocoding services and discovering similarity measures for a concrete but common industry use-case.
-
Bleifuß, T., Bornemann, L., Johnson, T., Kalashnikov, D.V., Naumann, F., Srivastava, D.: Exploring Change - A New Dimension of Data Analytics. Proceedings of the VLDB Endowment (PVLDB). 12, 85-98 (2018).
Data and metadata in datasets experience many different kinds of change. Values are inserted, deleted or updated; rows appear and disappear; columns are added or repurposed, etc. In such a dynamic situation, users might have many questions related to changes in the dataset, for instance which parts of the data are trustworthy and which are not? Users will wonder: How many changes have there been in the recent minutes, days or years? What kind of changes were made at which points of time? How dirty is the data? Is data cleansing required? The fact that data changed can hint at different hidden processes or agendas: a frequently crowd-updated city name may be controversial; a person whose name has been recently changed may be the target of vandalism; and so on. We show various use cases that benefit from recognizing and exploring such change. We envision a system and methods to interactively explore such change, addressing the variability dimension of big data challenges. To this end, we propose a model to capture change and the process of exploring dynamic data to identify salient changes. We provide exploration primitives along with motivational examples and measures for the volatility of data. We identify technical challenges that need to be addressed to make our vision a reality, and propose directions of future work for the data management community.
-
Kruse, S., Naumann, F.: Efficient Discovery of Approximate Dependencies. Proceedings of the VLDB Endowment. 11, 759-772 (2018).
See abstract for errata
Functional dependencies (FDs) and unique column combinations (UCCs) form a valuable ingredient for many data management tasks, such as data cleaning, schema recovery, and query optimization. Because these dependencies are unknown in most scenarios, their automatic discovery has been well researched. However, existing methods mostly discover only exact dependencies, i.e., those without violations. Realworld dependencies, in contrast, are frequently approximate due to data exceptions, ambiguities, or data errors. This relaxation to approximate dependencies renders their discovery an even harder task than the already challenging exact dependency discovery. To this end, we propose the novel and highly efficient algorithm Pyro to discover both approximate FDs and approximate UCCs. Pyro combines a separate-and-conquer search strategy with sampling-based guidance that quickly detects dependency candidates and verifies them. In our broad experimental evaluation, Pyro outperforms existing discovery algorithms by a factor of up to 33, scales to larger datasets, and at the same time requires the least main memory. --------------------- Errata / Corrigendum for Efficient Discovery of Approximate Dependencies Sebastian Kruse and Felix Naumann Proceedings of the VLDB Endowment 11 (7), 759-772 Readers of the paper have pointed out a few minor errors, which we document here to ease the understanding of the algorithm. Erratum 1) In Section 5.1, the PLI for “Last name” should read 1,4, 3,5. Erratum 2) In Section 5.3, Example 4, the tuple pairs (t1, t3), (t1, t5), and (t2, t3) should yield the agree set sample AS = (, 1), (First_name, Town, 1), (ZIP, 1)). Erratum 3) In Section 5.3, the example AUCC error of the attribute combination A1...An should be 0.0099.
-
Bornemann, L., Bleifuß, T., Kalashnikov, D., Naumann, F., Srivastava, D.: Data Change Exploration using Time Series Clustering. Datenbank-Spektrum. 18, 1-9 (2018).
Analysis of static data is one of the best studied research areas. However, data changes over time. These changes may reveal patterns or groups of similar values, properties, and entities. We study changes in large, publicly available data repositories by modelling them as time series and clustering these series by their similarity. In order to perform change exploration on real-world data we use the publicly available revision data of Wikipedia Infoboxes and weekly snapshots of IMDB. The changes to the data are captured as events, which we call change records. In order to extract temporal behavior we count changes in time periods and propose a general transformation framework that aggregates groups of changes to numerical time series of different resolutions. We use these time series to study different application scenarios of unsupervised clustering. Our explorative results show that changes made to collaboratively edited data sources can help find characteristic behavior, distinguish entities or properties and provide insight into the respective domains.
-
Kruse, S., Hahn, D., Walter, M., Naumann, F.: Metacrate: Organize and Analyze Millions of Data Profiles. Proceedings of the International Conference on Information and Knowledge Management (CIKM). pp. 2483-2486. ACM (2017).
-
Lazaridou, K., Krestel, R., Naumann, F.: Identifying Media Bias by Analyzing Reported Speech. International Conference on Data Mining. IEEE (2017).
Media analysis can reveal interesting patterns in the way newspapers report the news and how these patterns evolve over time. One example pattern is the quoting choices that media make, which could be used as bias indicators. Media slant can be expressed both with the choice of reporting an event, e.g. a person's statement, but also with the words used to describe the event. Thus, automatic discovery of systematic quoting patterns in the news could illustrate to the readers the media' beliefs, such as political preferences. In this paper, we aim to discover political media bias by demonstrating systematic patterns of reporting speech in two major British newspapers. To this end, we analyze news articles from 2000 to 2015. By taking into account different kinds of bias, such as selection, coverage and framing bias, we show that the quoting patterns of newspapers are predictable.
-
Maschler, F., Niephaus, F., Risch, J.: Real or Fake? Large-Scale Validation of Identity Leaks. 47. Jahrestagung der Gesellschaft für Informatik (INFORMATIK). pp. 2437-2448 (2017).
On the Internet, criminal hackers frequently leak identity data on a massive scale. Subsequent criminal activities, such as identity theft and misuse, put Internet users at risk. Leak checker services enable users to check whether their personal data has been made public. However, automatic crawling and identification of leak data is error-prone for different reasons. Based on a dataset of more than 180 million leaked identity records, we propose a software system that identifies and validates identity leaks to improve leak checker services. Furthermore, we present a proficient assessment of leak data quality and typical characteristics that distinguish valid and invalid leaks.
-
Zuo, Z., Loster, M., Krestel, R., Naumann, F.: Uncovering Business Relationships: Context-sensitive Relationship Extraction for Difficult Relationship Types. Proceedings of the Conference "Lernen, Wissen, Daten, Analysen" (LWDA) (2017).
This paper establishes a semi-supervised strategy for extracting various types of complex business relationships from textual data by using only a few manually provided company seed pairs that exemplify the target relationship. Additionally, we offer a solution for determining the direction of asymmetric relationships, such as “ownership of”. We improve the reliability of the extraction process by using a holistic pattern identification method that classifies the generated extraction patterns. Our experiments show that we can accurately and reliably extract new entity pairs occurring in the target relationship by using as few as five labeled seed pairs.
-
Harmouch, H., Naumann, F.: Cardinality Estimation: An Experimental Survey. Proceedings of the VLDB Endowment (PVLDB). pp. 499 - 512 (2017).
Data preparation and data profiling comprise many both basic and complex tasks to analyze a dataset at hand and extract metadata, such as data distributions, key candidates, and functional dependencies. Among the most important types of metadata is the number of distinct values in a column, also known as the zeroth-frequency moment. Cardinality estimation itself has been an active research topic in the past decades due to its many applications. The aim of this paper is to review the literature of cardinality estimation and to present a detailed experimental study of twelve algorithms, scaling far beyond the original experiments. First, we outline and classify approaches to solve the problem of cardinality estimation- we describe their main idea, error guarantees, advantages, and disadvantages. Our experimental survey then compares the performance all twelve cardinality estimation algorithms. We evaluate the algorithms' accuracy, runtime, and memory consumption using synthetic and real-world datasets. Our results show that different algorithms excel in different in categories, and we highlight their trade-offs
-
Papenbrock, T., Naumann, F.: A Hybrid Approach for Efficient Unique Column Combination Discovery. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 195-204 (2017).
Unique column combinations (UCCs) are groups of attributes in relational datasets that contain no value-entry more than once. Hence, they indicate keys and serve data management tasks, such as schema normalization, data integration, and data cleansing. Because the unique column combinations of a particular dataset are usually unknown, UCC discovery algorithms have been proposed to find them. All previous such discovery algorithms are, however, inapplicable to datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present the hybrid discovery algorithm HyUCC, which uses the same discovery techniques as the recently proposed functional dependency discovery algorithm HyFD: A hybrid combination of fast approximation techniques and efficient validation techniques. With it, the algorithm discovers all minimal unique column combinations in a given dataset. HyUCC does not only outperform all existing approaches, it also scales to much larger datasets.
-
Abedjan, Z., Golab, L., Naumann, F.: Data Profiling (tutorial). Proceedings of the International Conference on Management of Data (SIGMOD) (2017).
-
Krestel, R., Risch, J.: How Do Search Engines Work? A Massive Open Online Course with 4000 Participants. Proceedings of the Conference Lernen, Wissen, Daten, Analysen. pp. 259-271 (2017).
Massive Open Online Courses (MOOCs) have introduced a new form of education. With thousands of participants per course, lectur- ers are confronted with new challenges in the teaching process. In this pa- per, we describe how we conducted an introductory information retrieval course for participants from all ages and educational backgrounds. We analyze different course phases and compare our experiences with regular on-site information retrieval courses at university.
-
Repke, T., Loster, M., Krestel, R.: Comparing Features for Ranking Relationships Between Financial Entities Based on Text. Proceedings of the 3rd International Workshop on Data Science for Macro--Modeling with Financial and Economic Datasets. p. 12:1--12:2. ACM, New York, NY, USA (2017).
Evaluating the credibility of a company is an important and complex task for financial experts. When estimating the risk associated with a potential asset, analysts rely on large amounts of data from a variety of different sources, such as newspapers, stock market trends, and bank statements. Finding relevant information, such as relationships between financial entities, in mostly unstructured data is a tedious task and examining all sources by hand quickly becomes infeasible. In this paper, we propose an approach to rank extracted relationships based on text snippets, such that important information can be displayed more prominently. Our experiments with different numerical representations of text have shown, that ensemble of methods performs best on labelled data provided for the FEIII Challenge 2017.
-
Risch, J., Krestel, R.: What Should I Cite? Cross-Collection Reference Recommendation of Patents and Papers. Proceedings of the International Conference on Theory and Practice of Digital Libraries (TPDL). pp. 40-46 (2017).
Research results manifest in large corpora of patents and scientific papers. However, both corpora lack a consistent taxonomy and references across different document types are sparse. Therefore, and because of contrastive, domain-specific language, recommending similar papers for a given patent (or vice versa) is challenging. We propose a hybrid recommender system that leverages topic distributions and key terms to recommend related work despite these challenges. As a case study, we evaluate our approach on patents and papers of two fields: medical and computer science. We find that topic-based recommenders complement term-based recommenders for documents with collection-specific language and increase mean average precision by up to 23%. As a result of our work, publications from both corpora form a joint digital library, which connects academia and industry.
-
Loster, M., Zuo, Z., Naumann, F., Maspfuhl, O., Thomas, D.: Improving Company Recognition from Unstructured Text by using Dictionaries. Proceedings of the International Conference on Extending Database Technology. pp. 610-619 (2017).
While named entity recognition is a much addressed research topic, recognizing companies in text is of particular difficulty. Company names are extremely heterogeneous in structure, a given company can be referenced in many different ways, their names include person names, locations, acronyms, numbers, and other unusual tokens. Further, instead of using the official company name, quite different colloquial names are frequently used by the general public. We present a machine learning (CRF) system that reliably recognizes organizations in German texts. In particular, we construct and employ various dictionaries, regular expressions, text context, and other techniques to improve the results. In our experiments we achieved a precision of 91.11% and a recall of 78.82%, showing significant improvement over related work. Using our system we were able to extract 263,846 company mentions from a corpus of 141,970 newspaper articles.
-
Gruetze, T., Krestel, R., Lazaridou, K., Naumann, F.: What was Hillary Clinton doing in Katy, Texas? Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, 3-7 April, 2017. ACM (2017).
During the last presidential election in the United States of America, Twitter drew a lot of attention. This is because many leading persons and organizations, such as U.S. president Donald J. Trump, showed a strong affection to this medium. In this work we neglect the political contents and opinions shared on Twitter and focus on the question: Can we determine and track the physical location of the presidential candidates based on posts in the Twittersphere?
-
Kruse, S., Papenbrock, T., Dullweber, C., Finke, M., Hegner, M., Zabel, M., Zöllner, C., Naumann, F.: Fast Approximate Discovery of Inclusion Dependencies. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 207-226 (2017).
-
Bleifuß, T., Johnson, T., Kalashnikov, D.V., Naumann, F., Shkapenyuk, V., Srivastava, D.: Enabling Change Exploration (Vision). Proceedings of the Fourth International Workshop on Exploratory Search in Databases and the Web (ExploreDB). pp. 1-3 (2017).
Data and metadata suffer many different kinds of change: values are inserted, deleted or updated, entities appear and disappear, properties are added or re-purposed, etc. Explicitly recognizing, exploring, and evaluating such change can alert to changes in data ingestion procedures, can help assess data quality, and can improve the general understanding of the dataset and its behavior over time. We propose a data model-independent framework to formalize such change. Our change-cube enables exploration and discovery of such changes to reveal dataset behavior over time.
-
Papenbrock, T., Naumann, F.: Data-driven Schema Normalization. Proceedings of the International Conference on Extending Database Technology (EDBT). pp. 342-353 (2017).
Ensuring Boyce-Codd Normal Form (BCNF) is the most popular way to remove redundancy and anomalies from datasets. Normalization to BCNF forces functional dependencies (FDs) into keys and foreign keys, which eliminates duplicate values and makes data constraints explicit. Despite being well researched in theory, converting the schema of an existing dataset into BCNF is still a complex, manual task, especially because the number of functional dependencies is huge and deriving keys and foreign keys is NP-hard. In this paper, we present a novel normalization algorithm called Normalize, which uses discovered functional dependencies to normalize relational datasets into BCNF. Normalize runs entirely data-driven, which means that redundancy is removed only where it can be observed, and it is (semi-)automatic, which means that a user may or may not interfere with the normalization process. The algorithm introduces an efficient method for calculating the closure over sets of functional dependencies and novel features for choosing appropriate constraints. Our evaluation shows that Normalize can process millions of FDs within a few minutes and that the constraint selection techniques support the construction of meaningful relations during normalization.
-
Tschirschnitz, F., Papenbrock, T., Naumann, F.: Detecting Inclusion Dependencies on Very Many Tables. ACM Transactions on Database Systems (TODS). 42, 18:1-18:29 (2017).
-
Heller, D., Krestel, R., Ohler, U., Vingron, M., Marsico, A.: ssHMM: Extracting Intuitive Sequence-Structure Motifs from High-Throughput RNA-Binding Protein Data. Nucleic Acid Research. 45, 11004--11018 (2017).
RNA-binding proteins (RBPs) play an important role in RNA post-transcriptional regulation and recognize target RNAs via sequence-structure motifs. The extent to which RNA structure influences protein binding in the presence or absence of a sequence motif is still poorly understood. Existing RNA motif finders either take the structure of the RNA only partially into account, or employ models which are not directly interpretable as sequence-structure motifs. We developed ssHMM, an RNA motif finder based on a hidden Markov model (HMM) and Gibbs sampling which fully captures the relationship between RNA sequence and secondary structure preference of a given RBP. Compared to previous methods which output separate logos for sequence and structure, it directly produces a combined sequence-structure motif when trained on a large set of sequences. ssHMM’s model is visualized intuitively as a graph and facilitates biological interpretation. ssHMM can be used to find novel bona fide sequence-structure motifs of uncharacterized RBPs, such as the one presented here for the YY1 protein. ssHMM reaches a high motif recovery rate on synthetic data, it recovers known RBP motifs from CLIP-Seq data, and scales linearly on the input size, being considerably faster than MEMERIS and RNAcontext on large datasets while being on par with GraphProt. It is freely available on Github and as a Docker image.
-
Bleifuß, T., Kruse, S., Naumann, F.: Efficient Denial Constraint Discovery with Hydra. Proceedings of the VLDB Endowment (PVLDB). 11, 311-323 (2017).
Denial constraints (DCs) are a generalization of many other integrity constraints (ICs) widely used in databases, such as key constraints, functional dependencies, or order dependencies. Therefore, they can serve as a unified reasoning framework for all of these ICs and express business rules that cannot be expressed by the more restrictive IC types. The process of formulating DCs by hand is difficult, because it requires not only domain expertise but also database knowledge, and due to DCs' inherent complexity, this process is tedious and error-prone. Hence, an automatic DC discovery is highly desirable: we search for all valid denial constraints in a given database instance. However, due to the large search space, the problem of DC discovery is computationally expensive. We propose a new algorithm Hydra, which overcomes the quadratic runtime complexity in the number of tuples of state-of-the-art DC discovery methods. The new algorithm's experimentally determined runtime grows only linearly in the number of tuples. This results in a speedup by orders of magnitude, especially for datasets with a large number of tuples. Hydra can deliver results in a matter of seconds that to date took hours to compute.
-
Naumann, F., Krestel, R.: Das Fachgebiet „Informationssysteme“ am Hasso-Plattner-Institut. Datenbankspektrum. 17, 69-76 (2017).
-
Giesler, M.J., Keller, B., Repke, T., Leonhart, R., Weis, J., Muckelbauer, R., Rieckmann, N., Müller-Nordhorn, J., Lucius-Hoene, G., Holmberg, C.: Effect of a Website That Presents Patients' Experiences on Self-Efficacy and Patient Competence of Colorectal Cancer Patients: Web-Based Randomized Controlled Trial. J Med Internet Res. 19, e334 (2017).
Background: Patients often seek other patients' experiences with the disease. The Internet provides a wide range of opportunities to share and learn about other people's health and illness experiences via blogs or patient-initiated online discussion groups. There also exists a range of medical information devices that include experiential patient information. However, there are serious concerns about the use of such experiential information because narratives of others may be powerful and pervasive tools that may hinder informed decision making. The international research network DIPEx (Database of Individual Patients' Experiences) aims to provide scientifically based online information on people's experiences with health and illness to fulfill patients' needs for experiential information, while ensuring that the presented information includes a wide variety of possible experiences. Objective: The aim is to evaluate the colorectal cancer module of the German DIPEx website krankheitserfahrungen.de with regard to self-efficacy for coping with cancer and patient competence. Methods: In 2015, a Web-based randomized controlled trial was conducted using a two-group between-subjects design and repeated measures. The study sample consisted of individuals who had been diagnosed with colorectal cancer within the past 3 years or who had metastasis or recurrent disease. Outcome measures included self-efficacy for coping with cancer and patient competence. Participants were randomly assigned to either an intervention group that had immediate access to the colorectal cancer module for 2 weeks or to a waiting list control group. Outcome criteria were measured at baseline before randomization and at 2 weeks and 6 weeks Results: The study randomized 212 persons. On average, participants were 54 (SD 11.1) years old, 58.8% (124/211) were female, and 73.6% (156/212) had read or heard stories of other patients online before entering the study, thus excluding any influence of the colorectal cancer module on krankheitserfahrungen.de. No intervention effects were found at 2 and 6 weeks after baseline. Conclusions: The results of this study do not support the hypothesis that the website studied may increase self-efficacy for coping with cancer or patient competencies such as self-regulation or managing emotional distress. Possible explanations may involve characteristics of the website itself, its use by participants, or methodological reasons. Future studies aimed at evaluating potential effects of websites providing patient experiences on the basis of methodological principles such as those of DIPEx might profit from extending the range of outcome measures, from including additional measures of website usage behavior and users' motivation, and from expanding concepts, such as patient competency to include items that more directly reflect patients' perceived effects of using such a website. Trial Registration: Clinicaltrials.gov NCT02157454; https://clinicaltrials.gov/ct2/show/NCT02157454 (Archived by WebCite at http://www.webcitation.org/6syrvwXxi)
-
Krestel, R., Mottin, D., Müller, E. eds: Proceedings of the Conference "Lernen, Wissen, Daten, Analysen", Potsdam, Germany, September 12-14, 2016. CEUR-WS.org (2016).
-
Samiei, A., Koumarelas, I., Loster, M., Naumann, F.: Combination of Rule-based and Textual Similarity Approaches to Match Financial Entities. DSMM. ACM (2016).
Record linkage is a well studied problem with many years of publication history. Nevertheless, there are many challenges remaining to be addressed, such as the topic addressed by FEIII Challenge 2016. Matching financial entities (FEs) is important for many private and governmental organizations. In this paper we describe the problem of matching such FEs across three datasets: FFIEC, LEI and SEC.
-
Agrawal, D., Ba, L., Berti-Equille, L., Chawla, S., Elmagarmid, A., Hammady, H., Idris, Y., Kaoudi, Z., Khayyat, Z., Kruse, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Zaki, M.J.: Rheem: Enabling Multi-Platform Task Execution (demo). Proceedings of the ACM SIGMOD conference (SIGMOD) (2016).
-
Ehrlich, J., Roick, M., Schulze, L., Zwiener, J., Papenbrock, T., Naumann, F.: Holistic Data Profiling: Simultaneous Discovery of Various Metadata. Proceedings of the International Conference on Extending Database Technology (EDBT). pp. 305-316. OpenProceedings.org (2016).
Data profiling is the discipline of examining an unknown dataset for its structure and statistical information. It is a preprocessing step in a wide range of applications, such as data integration, data cleansing, or query optimization. For this reason, many algorithms have been proposed for the discovery of different kinds of metadata. When analyzing a dataset, these profiling algorithms are often applied in sequence, but they do not support one another, for instance, by sharing I/O cost or pruning information. We present the holistic algorithm MUDS, which jointly discovers the three most important metadata: inclusion dependencies, unique column combinations, and functional dependencies. By sharing I/O cost and data structures across the different discovery tasks, MUDS can clearly increase the efficiency of traditional sequential data profiling. The algorithm also introduces novel inter-task pruning rules that build upon different types of metadata, e.g., unique column combinations to infer functional dependencies. We evaluate MUDS in detail and compare it against the sequential execution of state-of-the-art algorithms. A comprehensive evaluation shows that our holistic algorithm outperforms the baseline by up to factor 48 on datasets with favorable pruning conditions.
-
Kruse, S., Jentzsch, A., Papenbrock, T., Kaoudi, Z., Quiane-Ruiz, J.-A., Naumann, F.: RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets. Proceedings of the International Conference on Management of Data (SIGMOD). pp. 953-967. ACM, New York, NY, USA (2016).
Inclusion dependencies (inds) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (cinds), which define including and included data in terms of conditions, allow to transfer these capabilities to rdf data. However, cind discovery is computationally much more complex than ind discovery and the number of cinds even on small rdf datasets is intractable. To cope with both problems, we first introduce the notion of pertinent cinds with an adjustable relevance criterion to filter and rank cinds based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent cinds in rdf data. RDFind employs a lazy pruning strategy to drastically reduce the cind search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of cinds. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before.
-
Bleifuß, T., Bülow, S., Frohnhofen, J., Risch, J., Wiese, G., Kruse, S., Papenbrock, T., Naumann, F.: Approximate Discovery of Functional Dependencies for Large Datasets. Proceedings of the International Conference on Information and Knowledge Management (CIKM). pp. 1803-1812. ACM, New York, NY, USA (2016).
Functional dependencies (FDs) are an important prerequisite for various data management tasks, such as schema normalization, query optimization, and data cleansing. However, automatic FD discovery entails an exponentially growing search and solution space, so that even today’s fastest FD discovery algorithms are limited to small datasets only, due to long runtimes and high memory consumptions. To overcome this situation, we propose an approximate discovery strategy that sacrifices possibly little result correctness in return for large performance improvements. In particular, we introduce AID-FD, an algorithm that approximately discovers FDs within runtimes up to orders of magnitude faster than state-of-the-art FD discovery algorithms. We evaluate and compare our performance results with a focus on scalability in runtime and memory, and with measures for completeness, correctness, and minimality.
-
Park, J., Blume-Kohout, M., Krestel, R., Nalisnick, E., Smyth, P.: Analyzing NIH Funding Patterns over Time with Statistical Text Analysis. Scholarly Big Data: AI Perspectives, Challenges, and Ideas (SBD 2016) Workshop at AAAI 2016. AAAI (2016).
In the past few years various government funding organizations such as the U.S. National Institutes of Health and the U.S. National Science Foundation have provided access to large publicly-available on-line databases documenting the grants that they have funded over the past few decades. These databases provide an excellent opportunity for the application of statistical text analysis techniques to infer useful quantitative information about how funding patterns have changed over time. In this paper we analyze data from the National Cancer Institute (part of National Institutes of Health) and show how text classification techniques provide a useful starting point for analyzing how funding for cancer research has evolved over the past 20 years in the United States.
-
Grundke, M., Jasper, J., Perchyk, M., Sachse, J.P., Krestel, R., Neves, M.: TextAI: Enhancing TextAE with Intelligent Annotation Support. Proceedings of the 7th International Symposium on Semantic Mining in Biomedicine (SMBM 2016). pp. 80-84. CEUR-WS.org (2016).
We present TextAI, an extension to the annotation tool TextAE, that adds support for named-entity recognition and automated relation extraction based on machine learning techniques. Our learning approach is domain-independent and increases the quality of the detected relations with each added training document. We further aim at accelerating and facilitating the manual curation process for natural language documents by supporting simultaneous annotation by multiple users.
-
Godde, C., Lazaridou, K., Krestel, R.: Classification of German Newspaper Comments. Proceedings of the Conference Lernen, Wissen, Daten, Analysen. pp. 299-310. CEUR-WS.org (2016).
Online news has gradually become an inherent part of many people’s every day life, with the media enabling a social and interactive consumption of news as well. Readers openly express their perspectives and emotions for a current event by commenting news articles. They also form online communities and interact with each other by replying to other users’ comments. Due to their active and significant role in the diffusion of information, automatically gaining insights of these comments’ content is an interesting task. We are especially interested in finding systematic differences among the user comments from different newspapers. To this end, we propose the following classification task: Given a news comment thread of a particular article, identify the newspaper it comes from. Our corpus consists of six well-known German newspapers and their comments. We propose two experimental settings using SVM classifiers build on comment- and article-based features. We achieve precision of up to 90% for individual newspapers.
-
Jenders, M., Krestel, R., Naumann, F.: Which Answer is Best? Predicting Accepted Answers in MOOC Forums. Proceedings of the 25th International Conference Companion on World Wide Web. pp. 679-684. International World Wide Web Conferences Steering Committee (2016).
Massive Open Online Courses (MOOCs) have grown in reach and importance over the last few years, enabling a vast userbase to enroll in online courses. Besides watching videos, user participate in discussion forums to further their understanding of the course material. As in other community-based question-answering communities, in many MOOC forums a user posting a question can mark the answer they are most satisfied with. In this paper, we present a machine learning model that predicts this accepted answer to a forum question using historical forum data.
-
Ziawasch Abedjan, L.G., Naumann, F.: Data Profiling (tutorial). International Conference on Data Engineering (ICDE) (2016).
One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to systematic inference of structural information and statistics of a dataset using dedicated profiling tools. In this tutorial, we highlight the importance of data profiling as part of any data-related use-case, and discuss the area of data profiling by classifying data profiling tasks and reviewing the state-of-the-art data profiling systems and techniques. In particular, we discuss hard problems in data profiling, such as algorithms for dependency discovery and profiling algorithms for dynamic data and streams. We conclude with directions for future research in the area of data profiling. This tutorial is based on our survey on profiling relational data [1].
-
Samiei, A., Naumann, F.: Cluster-based Sorted Neighborhood for Efficient Duplicate Detection. International Conference on Data Mining Workshops (ICDMW) (2016).
Duplicate detection intends to find multiple and syntactically different representations of the same real-world entities in a dataset. The naive way of duplicate detection entails a quadratic number of pair-wise record comparisons to identify the duplicates. This number of comparisons might take hours even for an average sized dataset. As today's databases grow very fast, different candidate-selection methods, such as sorted neighborhood, blocking, canopy clustering and their variations, address this problem by shrinking the comparison space. The volume and velocity of data-change require ever faster and more flexible methods of duplicate detection. In particular, they need dynamic indices that can be updated efficiently as new data arrives. We present a novel approach, which combines the idea of cluster-based methods with the well-known sorted neighborhood method. It carefully filters out irrelevant candidate pairs, which are less likely to yield duplicates, by pre-clustering records based not only on their proximity after sorting, but also on their similarity in selected attributes. An empirical evaluation on synthetic and real-world datasets shows that our approach improves the overall runtime over existing approaches while maintaining comparable result quality. Meanwhile, it uses dynamic indices, that in turns make it useful for deduplicating streaming data.
-
Gruetze, T., Krestel, R., Naumann, F.: Topic Shifts in StackOverflow: Ask it like Socrates. Lecture Notes in Computer Science. p. 213--221. Springer (2016).
Community based question-and-answer (Q&A) sites rely on well posed and appropriately tagged questions. However, most platforms have only limited capabilities to support their users in finding the right tags. In this paper, we propose a temporal recommendation model to support users in tagging new questions and thus improve their acceptance in the community. To underline the necessity of temporal awareness of such a model, we first investigate the changes in tag usage and show different types of collective attention in StackOverflow, a community-driven Q&A website for computer programming topics. Furthermore, we examine the changes over time in the correlation between question terms and topics. Our results show that temporal awareness is indeed important for recommending tags in Q&A communities.
-
Papenbrock, T., Naumann, F.: A Hybrid Approach to Functional Dependency Discovery. Proceedings of the International Conference on Management of Data (SIGMOD). pp. 821-833. ACM, New York, NY, USA (2016).
Functional dependencies are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Despite their importance, the functional dependencies of a specific dataset are usually unknown and almost impossible to discover manually. For this reason, database research has proposed various algorithms for functional dependency discovery. None, however, are able to process datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present a hybrid discovery algorithm called HyFD, which combines fast approximation techniques with efficient validation techniques in order to find all minimal functional dependencies in a given dataset. While operating on compact data structures, HyFD not only outperforms all existing approaches, it also scales to much larger datasets.
-
Lazaridou, K., Krestel, R.: Identifying Political Bias in News Articles. International Conference on Theory and Practice of Digital Libraries. IEEE Technical Committee on Digital Libraries (2016).
TPDL Doctoral Consortium
Individuals' political leaning, such as journalists', politicians' etc. often shapes the public opinion over several issues. In the case of online journalism, due to the numerous ongoing events, newspapers have to choose which stories to cover, emphasize on and possibly express their opinion about. These choices depict their profile and could reveal a potential bias towards a certain perspective or political position. Likewise, politicians' choice of language and the issues they broach are an indication of their beliefs and political orientation. Given the amount of user-generated text content online, such as news articles, blog posts, politician statements etc., automatically analyzing this information becomes increasingly interesting, in order to understand what people stand for and how they influence the general public. In this PhD thesis, we analyze UK news corpora along with parliament speeches in order to identify potential political media bias. We currently examine the politicians' mentions and their quotes in news articles and how this referencing pattern evolves in time.
-
Langer, P., Naumann, F.: Efficient Order Dependency Discovery. VLDB Journal. 25, 223-241 (2016).
Order dependencies (ODs) describe a relationship of order between lists of attributes in a relational table. ODs can help to understand the semantics of datasets and the applications producing them. They have applications in the field of query optimization by suggesting query rewrites. Also, the existence of an OD in a table can provide hints on which integrity constraints are valid for the domain of the data at hand. This work is the first to describe the discovery problem for order dependencies in a principled manner by characterizing the search space, developing and proving pruning rules, and presenting the algorithm Order, which finds all order dependencies in a given table. Order traverses the lattice of permutations of attributes in a level-wise bottom-up manner. In a comprehensive evaluation we show that it is efficient even for various large datasets. <p> Szlichta et al. propose a more efficient algorithm to discover order dependencies. In their paper they also point out flaws of our proposal:<br> Jaroslaw Szlichta, Parke Godfrey, Lukasz Golab, Mehdi Kargar, Divesh Srivastava: <a href="http://www.vldb.org/pvldb/vol10/p721-szlichta.pdf">Effective and Complete Discovery of Order Dependencies via Set-based Axiomatization</a>, in PVLDB 10(7), p. 721 - 732, 2017.
-
Naumann, F., Krestel, R.: The Information Systems Group at HPI. SIGMOD Record. (2016).
-
Gruetze, T., Kasneci, G., Zuo, Z., Naumann, F.: CohEEL: Coherent and Efficient Named Entity Linking through Random Walks. Web Semantics: Science, Services and Agents on the World Wide Web. 37, 75--89 (2016).
In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to a considerable increase of valuable textual information about entities. Harvesting entity knowledge from these large text collections is a major challenge. It requires the linkage of textual mentions within the documents with their real-world entities. This process is called entity linking. Solutions to this entity linking problem have typically aimed at balancing the rate of linking correctness (precision) and the linking coverage rate (recall). While entity links in texts could be used to improve various Information Retrieval tasks, such as text summarization, document classification, or topic-based clustering, the linking precision is the decisive factor. For example, for topic-based clustering a method that produces mostly correct links would be more desirable than a high-coverage method that leads to more but also more uncertain clusters. We propose an efficient linking method that uses a random walk strategy to combine a precision-oriented and a recall-oriented classifier in such a way that a high precision is maintained, while recall is elevated to the maximum possible level without affecting precision. An evaluation on three datasets with distinct characteristics demonstrates that our approach outperforms seminal work in the area and shows higher precision and time performance than the most closely related state-of-the-art methods.
-
Kruse, S., Papenbrock, T., Harmouch, H., Naumann, F.: Data Anamnesis: Admitting Raw Data into an Organization. IEEE Data Engineering Bulletin. 39, 8-20 (2016).
Today’s internet offers a plethora of openly available datasets, bearing great potential for novel applications and research. Likewise, rich datasets slumber within organizations. However, all too often those datasets are available only as raw dumps and lack proper documentation or even a schema. Data anamnesis is the first step of any effort to work with such datasets: It determines fundamental properties regarding the datasets’ content, structure, and quality to assess their utility and to put them to use appropriately. Detecting such properties is a key concern of the research area of data profiling, which has developed several viable instruments, such as data type recognition and foreign key discovery. In this article, we perform an anamnesis of the MusicBrainz dataset, an openly available and com- plex discographic database. In particular, we employ data profiling methods to create data summaries and then further analyze those summaries to reverse-engineer the database schema, to understand the data semantics, and to point out tangible schema quality issues. We propose two bottom-up schema quality dimensions, namely conciseness and normality, that measure the fit of the schema with its data, in contrast to a top-down approach that compares a schema with its application requirements.
-
Kruse, S., Papenbrock, T., Naumann, F.: Scaling Out the Discovery of Inclusion Dependencies. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 445-454 (2015).
Inclusion dependencies are among the most important database dependencies. In addition to their most prominent application – foreign key discovery – inclusion dependencies are an important input to data integration, query optimization, and schema redesign. With their discovery being a recurring data profiling task, previous research has proposed different algorithms to discover all inclusion dependencies within a given dataset. However, none of the proposed algorithms is designed to scale out, i.e., none can be distributed across multiple nodes in a computer cluster to increase the performance. So on large datasets with many inclusion dependencies, these algorithms can take days to complete, even on high-performance computers. We introduce SINDY, an algorithm that efficiently discovers all unary inclusion dependencies of a given relational dataset in a distributed fashion and that is not tied to main memory requirements. We give a practical implementation of SINDY that builds upon the map-reduce-style framework Stratosphere and conduct several experiments showing that SINDY can process huge datasets by several factors faster than its competitors while scaling with the number of cluster nodes.
-
Jentzsch, A., Mühleisen, H., Naumann, F.: Uniqueness, Density, and Keyness: Exploring Class Hierarchies. In Proceedings of 6th International Workshop on Consuming Linked Data (COLD 2015), ISWC 2015. , Bethlehem, PA, USA (2015).
-
Schmidt, D., Frohnhofen, J., Knebel, S., Meinel, F., Perchyk, M., Risch, J., Striebel, J., Wachtel, J., Baudisch, P.: Ergonomic Interaction for Touch Floors. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. pp. 3879-3888. ACM, Seoul, Republic of Korea (2015).
The main appeal of touch floors is that they are the only direct touch form factor that scales to arbitrary size, therefore allowing direct touch to scale to very large numbers of display objects. In this paper, however, we argue that the price for this benefit is bad physical ergonomics: prolonged standing, especially in combination with looking down, quickly causes fatigue and repetitive strain. We propose addressing this issue by allowing users to operate touch floors in any pose they like, including sitting and lying. To allow users to transition between poses seamlessly, we present a simple pose-aware view manager that supports users by adjusting the entire view to the new pose. We support the main assumption behind the work with a simple study that shows that several poses are indeed more ergonomic for touch floor interaction than standing. We ground the design of our view manager by analyzing, which screen regions users can see and touch in each of the respective poses.
-
Jentzsch, A., Dullweber, C., Troiano, P., Naumann, F.: Exploring Linked Data Graph Structures. In Proceedings of Posters and Demos Session, ISWC2015. , Bethlehem, PA, USA (2015).
-
Roick, M., Jenders, M., Krestel, R.: How to Stay Up-to-date on Twitter with General Keywords. Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. CEUR-WS.org (2015).
Microblogging platforms make it easy for users to share information through the publication of short personal messages. However, users are not only interested in sharing, but even more so in consuming information. As a result, they are confronted with new challenges when it comes to retrieving information on microblogging platforms. In this paper we present a query expansion method based on latent topics to support users interested in topical information. Similar to news aggregator sites, our approach identifies subtopics to a given query and provides the user with a quick overview of discussed topics within the microblogging platform. Using a document collection of microblog posts from Twitter as an exemplary microblogging platform, we compare the quality of search results returned by our algorithm with a baseline approach and a state-of-the-art microblog-specific query expansion method. To this end, we introduce a novel, innovative semi-supervised evaluation strategy based on expert Twitter users. In contrast to existing query expansion methods, our approach can be used to aggregate and visualize topical query results based on the calculated topic models, while achieving competitive results for traditional keyword-based search with regards to mean average precision.
-
Jenders, M., Lindhauer, T., Kasneci, G., Krestel, R., Naumann, F.: A Serendipity Model For News Recommendation. KI 2015: Advances in Artificial Intelligence - 38th Annual German Conference on AI, Dresden, Germany, September 21-25, 2015, Proceedings. pp. 111-123. Springer (2015).
Recommendation algorithms typically work by suggesting items that are similar to the ones that a user likes, or items that similar users like. We propose a content-based recommendation technique with the focus on serendipity of news recommendations. Serendipitous recommendations have the characteristic of being unexpected yet fortunate and interesting to the user, and thus might yield higher user satisfaction. In our work, we explore the concept of serendipity in the area of news articles and propose a general framework that incorporates the benefits of serendipity- and similarity-based recommendation techniques. An evaluation against other baseline recommendation models is carried out in a user study.
-
Kruse, S., Papotti, P., Naumann, F.: Estimating Data Integration and Cleaning Effort. Proceedings of the International Conference on Extending Database Technology (EDBT) (2015).
Data cleaning and data integration have been the topic of intensive research for at least the past thirty years, resulting in a multitude of specialized methods and integrated tool suites. All of them require at least some and in most cases significant human input in their configuration, during processing, and for evaluation. For managers (and for developers and scientists) it would be therefore of great value to be able to estimate the effort of cleaning and integrating some given data sets and to know the pitfalls of such an integration project in advance. This helps deciding about an integration project using cost/benefit analysis, budgeting a team with funds and manpower, and monitoring its progress. Further, knowledge of how well a data source fits into a given data ecosystem improves source selection. We present an extensible framework for the automatic effort estimation for mapping and cleaning activities in data integration projects with multiple sources. It comprises a set of measures and methods for estimating integration complexity and ultimately effort, taking into account heterogeneities of both schemas and instances and regarding both integration and cleaning operations. Experiments on two real-world scenarios show that our proposal is two to four times more accurate than a current approach in estimating the time duration of an integration process, and provides a meaningful breakdown of the integration problems as well as the required integration activitiesr nodes.
-
Krestel, R., Werkmeister, T., Wiradarma, T.P., Kasneci, G.: Tweet-Recommender: Finding Relevant Tweets for News Articles. Proceedings of the 24th International World Wide Web Conference (WWW). ACM (2015).
Twitter has become a prime source for disseminating news and opinions. However, the length of tweets prohibits detailed descriptions, instead, tweets sometimes contain URLs that link to detailed news articles. In this paper, we devise generic techniques for recommending tweets for any given news article. To evaluate and compare the different techniques, we collected tens of thousands of tweets and news articles and conducted a user study on the relevance of recommendations.
-
Schubotz, T., Krestel, R.: Online Temporal Summarization of News Events. Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT). pp. 679-684. IEEE Computer Society (2015).
Nowadays, an ever increasing number of news articles is published on a daily basis. Especially after notable national and international events or disasters, news coverage rises tremendously. Temporal summarization is an approach to automatically summarize such information in a timely manner. Summaries are created incrementally with progressing time, as soon as new information is available. Given a user-defined query, we designed a temporal summarizer based on probabilistic language models and entity recognition. First, all relevant documents and sentences are extracted from a stream of news documents using BM25 scoring. Second, a general query language model is created which is used to detect typical sentences respective to the query with Kullback-Leibler divergence. Based on the retrieval result, this query model is extended over time by terms appearing frequently during the particular event. Our system is evaluated with a document corpus including test data provided by the Text Retrieval Conference (TREC).
-
Gruetze, T., Yao, G., Krestel, R.: Learning Temporal Tagging Behaviour. Proceedings of the 24th International Conference on World Wide Web Companion (WWW). p. 1333--1338. ACM (2015).
Social networking services, such as Facebook, Google+, and Twitter are commonly used to share relevant Web documents with a peer group. By sharing a document with her peers, a user recommends the content for others and annotates it with a short description text. This short description yield many chances for text summarization and categorization. Because today’s social networking platforms are real-time media, the sharing behaviour is subject to many temporal effects, i.e., current events, breaking news, and trending topics. In this paper, we focus on time-dependent hashtag usage of the Twitter community to annotate shared Web-text documents. We introduce a framework for time-dependent hashtag recommendation models and introduce two content-based models. Finally, we evaluate the introduced models with respect to recommendation quality based on a Twitter-dataset consisting of links to Web documents that were aligned with hashtags.
-
Hennig, P., Berger, P., Dullweber, C., Finke, M., Maschler, F., Risch, J., Meinel, C.: Social Media Story Telling. Proceedings of the 8th IEEE International Conference on Social Computing and Networking (SocialCom2015). pp. 279-284. , Chengdu, China (2015).
The number of documents on the web increases rapidly and often there is an enormous information overlap between different sources covering the same topic. Since it is impractical to read through all posts regarding a subject, there is a need for summaries combining the most relevant facts. In this context combining information from different sources in form of stories is an important method to provide perspective, while presenting and enriching the existing content in an interesting, natural and narrative way. Today, stories are often not available or they have been elaborately written and selected by journalists. Thus, we present an automated approach to create stories from multiple input documents. Furthermore the developed framework implements strategies to visualize stories and link content to related sources of information, such as images, tweets and encyclopedia records ready to be explored by the reader. Our approach combines deriving a story line from a graph of interlinked sources with a story-centric multi-document summarization.
-
Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: SOFA: An Extensible Logical Optimizer for UDF-heavy Data Flows. Information Systems. 52, 96-125 (2015).
Recent years have seen an increased interest in large-scale analytical data flows on non-relational data. These data flows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such data flows are user-defined predicates or functions (UDFs). However, the heavy use of UDFs is not well taken into account for data flow optimization in current systems. SOFA is a novel and extensible optimizer for UDF-heavy data flows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style UDFs and a small set of rewrite rules, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. A salient feature of our approach is extensibility: We arrange user-defined operators and their properties into a subsumption hierarchy, which considerably eases integration and optimization of new operators. We evaluate SOFA on a selection of UDF-heavy data flows from different domains and compare its performance to three other algorithms for data flow optimization. Our experiments reveal that SOFA finds efficient plans, outperforming the best plans found by its competitors by a factor of up to six.
-
Krestel, R., Dokoohaki, N.: Diversifying Customer Review Rankings. Neural Networks. 66, 36-45 (2015).
E-commerce Web sites owe much of their popularity to consumer reviews accompanying product descriptions. On-line customers spend hours and hours going through heaps of textual reviews to decide which products to buy. At the same time, each popular product has thousands of user-generated reviews, making it impossible for a buyer to read everything. Current approaches to display reviews to users or recommend an individual review for a product are based on the recency or helpfulness of each review. In this paper, we present a framework to rank product reviews by optimizing the coverage of the ranking with respect to sentiment or aspects, or by summarizing all reviews with the top-K reviews in the ranking. To accomplish this, we make use of the assigned star rating for a product as an indicator for a review’s sentiment polarity and compare bag-of-words (language model) with topic models (latent Dirichlet allocation) as a mean to represent aspects. Our evaluation on manually annotated review data from a commercial review Web site demonstrates the effectiveness of our approach, outperforming plain recency ranking by 30% and obtaining best results by combining language and topic model representations.
-
Papenbrock, T., Heise, A., Naumann, F.: Progressive Duplicate Detection. IEEE Transactions on Knowledge and Data Engineering (TKDE). 27, 1316-1329 (2015).
Duplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive duplicate detection algorithms that significantly increase the efficiency of finding duplicates if the execution time is limited: They maximize the gain of the overall process within the time available by reporting most results much earlier than traditional approaches. Comprehensive experiments show that our progressive algorithms can double the efficiency over time of traditional duplicate detection and significantly improve upon related work.
-
Papenbrock, T., Kruse, S., Quiane-Ruiz, J.-A., Naumann, F.: Divide & Conquer-based Inclusion Dependency Discovery. Proceedings of the VLDB Endowmen. 8, 774-785 (2015).
The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose BINDER, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets – an important property on the face of the ever increasing size of today’s data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders BINDER an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of BINDER over the state-of-the-art in both unary (SPIDER) and n-ary (MIND) IND discovery. BINDER is up to 26x faster than SPIDER and more than 2500x faster than MIND.
-
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB Journal. 24, 557-581 (2015).
Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.
-
Papenbrock, T., Bergmann, T., Finke, M., Zwiener, J., Naumann, F.: Data Profiling with Metanome. Proceedings of the VLDB Endowment. 8, 1860-1871 (2015).
Data profiling is the discipline of discovering metadata about given datasets. The metadata itself serve a variety of use cases, such as data integration, data cleansing, or query optimization. Due to the importance of data profiling in practice, many tools have emerged that support data scientists and IT professionals in this task. These tools provide good support for profiling statistics that are easy to compute, but they are usually lacking automatic and efficient discovery of complex statistics, such as inclusion dependencies, unique column combinations, or functional dependencies. We present Metanome, an extensible profiling platform that incorporates many state-of-the-art profiling algorithms. While Metanome is able to calculate simple profiling statistics in relational data, its focus lies on the automatic discovery of complex metadata. Metanome’s goal is to provide novel profiling algorithms from research, perform comparative evaluations, and to support developers in building and testing new algorithms. In addition, Metanome is able to rank profiling results according to various metrics and to visualize the at times large metadata sets.
-
Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J.-P., Schönberg, M., Zwiener, J., Naumann, F.: Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms. Proceedings of the VLDB Endowment. 8, 1082-1093 (2015).
Functional dependencies are important metadata used for schema normalization, data cleansing and many other tasks. The efficient discovery of functional dependencies in tables is a well-known challenge in database research and has seen several approaches. Because no comprehensive comparison between these algorithms exist at the time, it is hard to choose the best algorithm for a given dataset. In this experimental paper, we describe, evaluate, and compare the seven most cited and most important algorithms, all solving this same problem. First, we classify the algorithms into three different categories, explaining their commonalities. We then describe all algorithms with their main ideas. The descriptions provide additional details where the original papers were ambiguous or incomplete. Our evaluation of careful re-implementations of all algorithms spans a broad test space including synthetic and real-world data. We show that all functional dependency algorithms optimize for certain data characteristics and provide hints on when to choose which algorithm. In summary, however, all current approaches scale surprisingly poorly, showing potential for future research.
-
Langer, P., Schulze, P., George, S., Kohnen, M., Metzke, T., Abedjan, Z., Kasneci, G.: Assigning Global Relevance Scores to DBpedia Facts. International Workshop on Data Engineering meets the Semantic Web (DESWeb). , Chicago, IL (2014).
-
Abedjan, Z., Schulze, P., Naumann, F.: DFD: Efficient Discovery of Functional Dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), Shanghai, China. pp. 949-958 (2014).
The discovery of unknown functional dependencies in a dataset is of great importance for database redesign, anomaly detection and data cleansing applications. However, as the nature of the problem is exponential in the number of attributes none of the existing approaches can be applied on large datasets. We present a new algorithm DFD for discovering all functional dependencies in a dataset following a depth-first traversal strategy of the attribute lattice that combines aggressive pruning and efficient result verification. Our approach is able to scale far beyond existing algorithms for up to 7.5 million tuples, and is up to three orders of magnitude faster than existing approaches on smaller datasets. Winner of the CIKM 2014 Best Student Paper Award
-
Meyer, A., Pufahl, L., Batoulis, K., Kruse, S., Lindhauer, T., Stoff, T., Fahland, D., Weske, M.: Data Perspective in Process Choreographies: Modeling and Execution. 26th International Conference on Advanced Information Systems Engineering. , Thessaloniki, Greece (2014).
-
Abedjan, Z., Naumann, F.: Amending RDF Entities with New Facts. Know@LOD Workshop in conjunction with ESWC. , Creete, Greece (2014).
Selected for Best Workshop Paper Award.
-
Abedjan, Z., Gruetze, T., Jentzsch, A., Naumann, F.: Profiling and Mining RDF Data with ProLOD++. Proceedings of the IEEE International Conference on Data Engineering (ICDE), Demo. , Chicago, IL (2014).
-
Forchhammer, B., Jentzsch, A., Naumann, F.: LODOP - Multi-Query Optimization for Linked Data Profiling Queries. In Proceedings of the International Workshop on Dataset PROFIling & fEderated Search for Linked Data (PROFILES) in conjunction with ESWC. , Heraklion, Greece (2014).
Selected for Best Workshop Paper Award.
-
Rheinländer, A., Beckmann, M., Kunkel, A., Heise, A., Stoltmann, T., Leser, U.: Versatile optimization of UDF-heavy data flows with SOFA (demo). Proceedings of the SIGMOD conference. pp. 685-688 (2014).
-
Abedjan, Z., Quanie-Ruiz, J.-A., Naumann, F.: Detecting Unique Column Combinations on Dynamic Data. Proceedings of the IEEE International Conference on Data Engineering (ICDE). , Chicago, IL (2014).
-
Heise, A., Kasneci, G., Naumann, F.: Estimating the Number and Sizes of Fuzzy-Duplicate Clusters. Proceedings of the Conference on Information and Knowledge Management (CIKM). pp. 959-968 (2014).
-
Zuo, Z., Kasneci, G., Gruetze, T., Naumann, F.: BEL: Bagging for Entity Linking. 25th International Conference on Computational Linguistics (COLING). , Dublin, Ireland (2014).
-
Vogel, T., Naumann, F.: Semi-Supervised Consensus Clustering: Reducing Human Effort. Proceedings of the International Workshop on Data Integration and Applications (2014).
Machine-based clustering yields fuzzy results. For example, when detecting duplicates in a dataset, different tools might end up with different clusterings. Eventually, a decision needs to be made, defining which records are in the same cluster, i. e., are duplicates. Such a definitive result is called a Consensus Clustering and can be created by evaluating the clustering attempts against each other and only resolving the disagreements by human experts. Yet, there can be different consensus clusterings, depending on the choice of disagreements presented to the human expert. In particular, they may require a different number of manual inspections. We present a set of strategies to select the smallest set of manual inspections to arrive at a consensus clustering and evaluate their efficiency on a set of real-world and synthetic datasets.
-
Gruetze, T., Kasneci, G., Zuo, Z., Naumann, F.: Bootstrapping Wikipedia to Answer Ambiguous Person Name Queries. 10th International Workshop on Information Integration on the Web (IIWeb). , Chicago, IL (2014).
-
Krestel, R., Bergler, S., Witte, R.: Modeling human newspaper readers: The Fuzzy Believer approach. Natural Language Engineering. 20, 261--288 (2014).
The growing number of publicly available information sources makes it impossible for individuals to keep track of all the various opinions on one topic. The goal of our Fuzzy Believer system presented in this paper is to extract and analyze statements of opinion from newspaper articles. Beliefs are modeled using the fuzzy set theory, applied after Natural Language Processing-based information extraction. The Fuzzy Believer models a human agent, deciding what statements to believe or reject based on a range of configurable strategies.
-
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal. (2014).
Selected for 2014 Semantic Web journal outstanding paper award.
-
Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J.-C., Hueske, F., Heise, A., Kao, O., Leich, M., Leser, U., Markl, V., Naumann, F., Peters, M., Rheinländer, A., Sax, M.J., Schelter, S., Höger, M., Tzoumas, K., Warneke, D.: The Stratosphere Platform for Big Data Analytics. The VLDB Journal. 23, 939-964 (2014).
We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of “Big Data” use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.
-
Vogel, T., Heise, A., Draisbach, U., Lange, D., Naumann, F.: Reach for Gold: An Annealing Standard to Evaluate Duplicate Detection Results. JDIQ. 5, (2014).
-
Lorey, J.: Identifying and Determining SPARQL Endpoint Characteristics. International Journal of Web Information Systems. 10, (2014).
Publicly accessible SPARQL endpoints contain vast amounts of knowledge from a large variety of domains. Utilizing the structured query language, users can consume, integrate, and present data from such Linked Data sources for different application scenarios. However, oftentimes these endpoints are not configured to process specific workloads as efficiently as possible. Implemented restrictions further impede data consumption, e.g., by limiting the number of results returned per request. Assisting users in leveraging SPARQL endpoints requires insight into functional and non-functional properties of these knowledge bases. In this work, we introduce several metrics that enable universal and fine-grained characterization of arbitrary Linked Data repositories. We present comprehensive approaches for deriving these metrics and validate them through extensive evaluation on real-world SPARQL endpoints. Finally, we discuss possible implications of our findings for data consumers
-
Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows. (2013).
-
Albrecht, A., Naumann, F.: Systematic ETL Management – Experiences with High-Level Operators. Proceedings of the 18th International Conference on Information Quality (ICIQ). , Little Rock, AK (2013).
-
Abedjan, Z., Naumann, F.: Synonym Analysis for Predicate Expansion. Proceedings of the Extended Semantic Web Conference (ESWC), Montpellier, France (2013).
-
Lorey, J., Naumann, F.: Caching and Prefetching Strategies for SPARQL Queries. ESWC 2013 Satellite Events -- Revised Selected Papers. , Montpellier, France (2013).
Linked Data repositories offer a wealth of structured facts, useful for a wide array of application scenarios. However, retrieving this data using SPARQL queries yields a number of challenges, such as limited endpoint capabilities and availability, or high latency for connecting to it. To cope with these challenges, we argue that it is advantageous to cache data that is relevant for future information needs. However, instead of only retaining results of previously issued queries, we aim at retrieving data that is potentially interesting for subsequent requests in advance. To this end, we present different methods to modify the structure of a query so that the altered query can be used to retrieve additional related information. We evaluate these approaches by applying them to requests found in real-world SPARQL query logs.
-
Jenders, M., Kasneci, G., Naumann, F.: Analyzing and Predicting Viral Tweets. Proceedings of the WWW '13 Companion: 22nd International World Wide Web Conference. , Rio de Janeiro, Brazil (2013).
Twitter and other microblogging services have become indispensable sources of information in today's web. Understanding the main factors that make certain pieces of information spread quickly in these platforms can be decisive for the analysis of opinion formation and many other opinion mining tasks. This paper addresses important questions concerning the spread of information on Twitter. What makes Twitter users retweet a tweet? Is it possible to predict whether a tweet will become "viral", i.e., will be frequently retweeted? To answer these questions we provide an extensive analysis of a wide range of tweet and user features regarding their influence on the spread of tweets. The most impactful features are chosen to build a learning model that predicts viral tweets with high accuracy. All experiments are performed on a real-world dataset, extracted through a public Twitter API based on user IDs from the TREC 2011 microblog corpus.
-
Lorey, J., Naumann, F.: Caching and Prefetching Strategies for SPARQL Queries. Proceedings of the 3rd International Workshop on Usage Analysis and the Web of Data (USEWOD). , Montpellier, France (2013).
Selected as Best Workshop Paper for publication in ESWC post-proceedings
Linked Data repositories offer a wealth of structured facts, useful for a wide array of application scenarios. However, retrieving this data using SPARQL queries yields a number of challenges, such as limited endpoint capabilities and availability, or high latency for connecting to it. To cope with these challenges, we argue that it is advantageous to cache data that is relevant for future information needs. However, instead of only retaining results of previously issued queries, we aim at retrieving data that is potentially interesting for subsequent requests in advance. To this end, we present different methods to modify the structure of a query so that the altered query can be used to retrieve additional related information. We evaluate these approaches by applying them to requests found in real-world SPARQL query logs.
-
Lorey, J.: Storing and Provisioning Linked Data as a Service. Proceedings of the 10th Extended Semantic Web Conference (ESWC). , Montpellier, France (2013).
Linked Data offers novel opportunities for aggregating information about a wide range of topics and for a multitude of applications. While the technical specifications of Linked Data have been a major research undertaking for the last decade, there is still a lack of real-world data and applications exploiting this data. Partly, this is due to the fact that datasets remain isolated from one another and their integration is a non-trivial task. In this work, we argue for a Data-as-a-Service approach combining both warehousing and query federation to discover and consume Linked Data. We compare our work to state-of-the-art approaches for discovering, integrating, and consuming Linked Data. Moreover, we illustrate a number of challenges when combining warehousing with federation features, and highlight key aspects of our research.
-
Lorey, J.: SPARQL Endpoint Metrics for Quality-Aware Linked Data Consumption. Proceedings of the 15th International Conference on Information Integration and Web-based Applications & Services (iiWAS '13). , Vienna, Austria (2013).
In recent years, dozens of publicly accessible Linked Data repositories containing vast amounts of knowledge presented in the Resource Description Framework (RDF) format have been set up worldwide. By utilizing the SPARQL query language, users can consume, integrate, and present data from a federation of sources for different application scenarios. However, several challenges arise for distributed query processing across multiple SPARQL endpoints, such as devising suitable query optimization or result caching strategies. For implementing these techniques, one crucial aspect is determining appropriate endpoint features. In this work, we introduce several metrics that enable universal and fine-grained characterization of arbitrary Linked Data repositories. We present comprehensive approaches for deriving these metrics and validate them through extensive evaluation on real-world SPARQL endpoints. Finally, we discuss possible implications of our findings for data consumers.
-
Leich, M., Adamek, J., Schubotz, M., Heise, A., Rheinlander, A., Markl, V.: Applying Stratosphere for Big Data Analytics. Database Systems for Business, Technology, and Web (BTW) (2013).
-
Lange, D., Naumann, F.: Bulk Sorted Access for Efficient Top-k Retrieval. Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM). , Baltimore, Maryland (2013).
-
Draisbach, U., Naumann, F.: On Choosing Thresholds for Duplicate Detection. Proceedings of the 18th International Conference on Information Quality (ICIQ). , Little Rock, USA (2013).
-
Lacoste-Julien, S., Palla, K., Davies, A., Kasneci, G., Graepel, T., Ghahramani, Z.: SiGMa: Simple Greedy Matching for Aligning Large Knowledge Bases. Proceedings of the 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2013).
-
Forchhammer, B., Papenbrock, T., Stening, T., Viehmeier, S., Draisbach, U., Naumann, F.: Duplicate Detection on GPUs. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 165-184 (2013).
Runner Up for Best Paper Award
With the ever increasing volume of data and the ability to integrate different data sources, data quality problems abound. Duplicate detection, as an integral part of data cleansing, is essential in modern information systems. We present a complete duplicate detection workflow that utilizes the capabilities of modern graphics processing units (GPUs) to increase the efficiency of finding duplicates in very large datasets. Our solution covers several well-known algorithms for pair selection, attribute-wise similarity comparison, record-wise similarity aggregation, and clustering. We redesigned these algorithms to run memory-efficiently and in parallel on the GPU. Our experiments demonstrate that the GPU-based workflow is able to outperform a CPU-based implementation on large, real-world datasets. For instance, the GPU-based algorithm deduplicates a dataset with 1.8m entities 10 times faster than a common CPU-based algorithm using comparably priced hardware.
-
Lorey, J., Naumann, F.: Detecting SPARQL Query Templates for Data Prefetching. Proceedings of the 10th Extended Semantic Web Conference (ESWC). , Montpellier, France (2013).
Publicly available Linked Data repositories provide a multitude of information. By utilizing SPARQL, Web sites and services can consume this data and present it in a user-friendly form, e.g., in mash-ups. To gather RDF triples for this task, machine agents typically issue similarly structured queries with recurring patterns against the SPARQL endpoint. These queries usually differ only in a small number of individual triple pattern parts, such as resource labels or literals in objects. We present an approach to detect such recurring patterns in queries and introduce the notion of query templates, which represent clusters of similar queries exhibiting these recurrences. We describe a matching algorithm to extract query templates and illustrate the benefits of prefetching data by utilizing these templates. Finally, we comment on the applicability of our approach using results from real-world SPARQL query logs.
-
Heise, A., Quiane-Ruiz, J.-A., Abedjan, Z., Jentzsch, A., Naumann, F.: Scalable Discovery of Unique Column Combinations. Proceedings of the VLDB Endowment (PVLDB). , Hangzhou, China (2013).
Jorge's presentation at VLDB 2014 was awarded the "Excellent Presentation Award".
The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself. In this paper, we devise DUCC, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows DUCC to typically depend on the solution set size and hence to prune large swaths of the lattice. DUCC also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, DUCC runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate DUCC using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare DUCC with related work: Gordian and HCA. The results show that DUCC is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of DUCC to scale up and out.
-
Lange, D., Naumann, F.: Cost-Aware Query Planning for Similarity Search. Information Systems (IS). 38, 455--469 (2013).
-
Naumann, F., Jenders, M., Papenbrock, T.: Ein Datenbankkurs mit 6000 Teilnehmern - Erfahrungen auf der openHPI MOOC Plattform. Informatik-Spektrum. 37, 333-340 (2013).
-
Naumann, F.: Data Profiling Revisited. SIGMOD Record. 32, 40-49 (2013).
Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies. Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, more and more data beyond the traditional relational databases are being created and beg to be profiled. The article proposes new research directions and challenges, including interactive and incremental profiling and profiling heterogeneous and non-relational data.
-
Momtazi, S., Naumann, F.: Topic modeling for expert finding using latent dirichlet allocation. WIREs Data Mining and Knowledge Discovery. 3, 346–353 (2013).
The task of expert finding is to rank the experts in the search space given a field of expertise as an input query. In this paper, we propose a topic modeling approach for this task. The proposed model uses latent Dirichlet allocation (LDA) to induce probabilistic topics. In the first step of our algorithm, the main topics of a document collection are extracted using LDA. The extracted topics present the connection between expert candidates and user queries. In the second step, the topics are used as a bridge to find the probability of selecting each candidate for a given query. The candidates are then ranked based on these probabilities. The experimental results on the Text REtrieval Conference (TREC) Enterprise track for 2005 and 2006 show that the proposed topic-based approach outperforms the state-of-the-art profile- and document-based models, which use information retrieval methods to rank experts. Moreover, we present the superiority of the proposed topic-based approach to the improved document-based expert finding systems, which consider additional information such as local context, candidate prior, and query expansion.
-
Rinser, D., Lange, D., Naumann, F.: Cross-lingual Entity Matching and Infobox Alignment in Wikipedia. Information Systems (IS). 38, 887–907 (2013).
-
Abedjan, Z., Naumann, F.: Improving RDF Data through Association Rule Mining. Datenbank-Spektrum (Special Issue on RDF Data Management). 13, 111--120 (2013).
-
Bauckmann, J., Abedjan, Z., Leser, U., Müller, H., Naumann, F.: Covering or complete? : discovering conditional inclusion dependencies. Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam (2012).
ISBN 978-3-86956-212-4, ISSN 1613-5652
-
Draisbach, U., Naumann, F.: Adaptive Windows for Duplicate Detection. Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam (2012).
ISBN 978-3-86956-143-1, ISSN 1613-5652
-
Albrecht, A., Naumann, F.: Understanding Cryptic Schemata in Large Extract-Transform-Load Systems. Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam (2012).
ISBN 978-3-86956-201-8, ISSN 1613-5652
-
Böhm, C., de Melo, G., Naumann, F., Weikum, G.: LINDA: Distributed Web-of-Data-Scale Entity Matching. Proceedings of the International Conference on Information and Knowledge Management (CIKM), Maui, Hawaii (2012).
-
Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient Similarity Search in Very Large String Sets. Proceedings of the International Conference on Scientific and Statistical DatabaseManagement (SSDBM). , Chania, Crete, Greece (2012).
-
Köppelmann, M., Lange, D., Lehmann, C., Marszalkowski, M., Naumann, F., Retzlaff, P., Stange, S., Voget, L.: Scalable Similarity Search with Dynamic Similarity Measures. Proceedings of the 6th International Workshop on Ranking in Databases (DBRank) in conjunction with VLDB. , Istanbul, Turkey (2012).
-
Abedjan, Z., Lorey, J., Naumann, F.: Reconciling Ontologies and the Web of Data. Proceedings of the 21st International Conference on Information and Knowledge Management (CIKM). pp. 1532-1536. , Maui, Hawaii, USA (2012).
-
Heise, A., Rheinländer, A., Leich, M., Leser, U., Naumann, F.: Meteor/Sopremo: An Extensible Query Language and Operator Model. Proceedings of the International Workshop on End-to-end Management of Big Data (BigData) in conjunction with VLDB 2012. , Istanbul, Turkey (2012).
-
Böhm, C., Freitag, M., Heise, A., Lehmann, C., Mascher, A., Naumann, F., Hernandez, M., Ercegovac, V., Haase, P.: GovWILD: Integrating Open Government Data for Transparency (demo). Proceedings of the International World Wide Web Conference (WWW). , Lyon, France (2012).
-
Tafaj, E., Kasneci, G., Rosenstiel, W., Bogdan, M.: Bayesian online clustering of eye movement data. Proceedings of the 2012 Symposium on Eye-Tracking Research and Applications. pp. 285-288. ACM (2012).
-
Vogel, T., Naumann, F.: Automatic Blocking Key Selection for Duplicate Detection based on Unigram Combinations. Proceedings of the 10th International Workshop on Quality in Databases (QDB) in conjunction with VLDB (2012).
Duplicate detection is the process of identifying multiple but different representations of same real-world objects, which typically involves a large number of comparisons. Partitioning is a well-known technique to avoid many unnecessary comparisons. However, partitioning keys are usually handcrafted, which is tedious and the keys are often poorly chosen. We propose a technique to find suitable blocking keys automatically for a dataset equipped with a gold standard. We then show how to re-use those blocking keys for datasets from similar domains lacking a gold standard. Blocking keys are created based on unigrams, which we extend with length-hints for further improvement. Blocking key creation is accompanied with several comprehensive experiments on large artificial and real-world datasets.
-
Albrecht, A., Naumann, F.: Schema Decryption for Large Extract-Transform-Load Systems. Proceedings of the 31st International Conference on Conceptual Modeling (ER 2012). , Florence, Italy (2012).
-
Böhm, C., Kasneci, G., Naumann, F.: Latent Topics in Graph-Structured Data. Proceedings of the Conference on Information and Knowledge Management (CIKM) (2012).
Large amounts of graph-structured data are emerging from various avenues, ranging from natural and life sciences to so- cial and semantic web communities. We address the problem of discovering subgraphs of entities that reflect latent topics in graph-structured data. These topics are structured meta- information providing further insights into the data. The presented approach effectively detects such topics by exploit- ing only the structure of the underlying graph, thus avoiding the dependency on textual labels, which are a scarce asset in prevalent graph datasets. The viability of our approach is demonstrated in experiments on real-world datasets.
-
Gruetze, T., Böhm, C., Naumann, F.: Holistic and Scalable Ontology Alignment for Linked Open Data. Proceedings of the 5th Linked Data on the Web (LDOW) Workshop at the 21th International World Wide Web Conference (WWW). , Lyon, France (2012).
-
Momtazi, S.: Fine-grained German Sentiment Analysis on Social Media. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC). , Istanbul, Turkey (2012).
-
Böhm, C., Hefenbrock, D., Naumann, F.: Scalable Peer-to-Peer-based RDF Management. Proceedings of the 8th Int. Conference on Semantic Systems. , Graz, Austria (2012).
Handling web-scale RDF data requires sophisticated data management that scales easily and integrates seamlessly into existing analysis workflows. We present HDRS – a scalable storage infrastructure that enables online-analysis of very large RDF data sets. Hdrs combines state-of-the-art data management techniques to organize triples in indexes that are sharded and stored in a peer-to-peer system. The store is open source at http://code.google.com/p/hdrs and integrates well with Hadoop MapReduce or any other client application.
-
Kasneci, G.: Reasoning about Knowledge from the Web - (Extended Abstract). ICWE Workshops. pp. 186-188. Springer (2012).
-
Bauckmann, J., Abedjan, Z., Müller, H., Leser, U., Naumann, F.: Discovering Conditional Inclusion Dependencies. Proceedings of the International Conference on Information and Knowledge Management (CIKM), Maui, Hawaii. pp. 2094-2098 (2012).
-
Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive Windows for Duplicate Detection. Proceedings of the 28th International Conference on Data Engineering (ICDE). , Washington, D.C., USA (2012).
-
Draisbach, U.: Partitionierung zur effizienten Duplikaterkennung in relationalen Daten. Springer Vieweg (2012).
-
Beskales, G., Das, G., Elmagarmid, A.K., Ilyas, I.F., Naumann, F., Ouzzani, M., Papotti, P., Quiane-Ruiz, J., Tang, N.: The Data Analytics Group at the Qatar Computing Research Institute. SIGMOD Record. 41, (2012).
-
Abelló, A., Darmont, J., Etcheverry, L., Golfarelli, M., Mazón, J.-N., Naumann, F., Pedersen, T.B., Rizzi, S., Trujillo, J., Vassiliadis, P., Vossen, G.: Fusion Cubes: Towards Self-Service Business Intelligence. International Journal of Data Warehousing and Mining (IJDWM). 9, 66-88 (2012).
Self-service business intelligence is about enabling non-expert users to make well-informed decisions by enriching the decision process with situational data, i.e., data that have a narrow focus on a specific business problem and, typically, a short lifespan for a small group of users. Often, these data are not owned and controlled by the decision maker, their search, extraction, integration, and storage for reuse or sharing should be accomplished by decision makers without any intervention by designers or programmers. The goal of this paper is to present the framework we envision to support self-service business intelligence and the related research challenges, the underlying core idea is the notion of fusion cubes, i.e., multidimensional cubes that can be dynamically extended both in their schema and their instances, and in which situational data and metadata are associated with quality and provenance annotations.
-
Heise, A., Naumann, F.: Integrating Open Government Data with Stratosphere for more Transparency. Web Semantics: Science, Services and Agents on the World Wide Web. 14, 45 - 56 (2012).
Governments are increasingly publishing their data to enable organizations and citizens to browse and analyze thedata. However, the heterogeneity of this Open Government Data hinders meaningful search, analysis, and integrationand thus limits the desired transparency.In this article, we present the newly developed data integration operators of the Stratosphere parallel data analysisframework to overcome the heterogeneity. With declaratively specified queries, we demonstrate the integration ofwell-known government data sources and other large open data sets at technical, structural, and semantic levels.Furthermore, we publish the integrated data on theWeb in a form that enables users to discover relationships betweenpersons, government agencies, funds, and companies. The evaluation shows that linking person entities of dierentdata sets results in a good precision of 98.3% and a recall of 95.2%. Moreover, the integration of large data sets scaleswell on up to eight machines.
-
Herschel, M., Naumann, F., Szott, S., Taubert, M.: Scalable Iterative Graph Duplicate Detection. Transactions on Knowledge and Data Engineering (TKDE). 24, 2094-2108 (2012).