For bachelor students we offer German lectures on database systems in addition to paper- or project-oriented seminars. Within a one-year bachelor project, students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, and information retrieval enhanced by specialized seminars, master projects and we advise master theses.
Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our datasets and source code.
Deep neural networks can be used to create representations for words, sentences, and documents, as well as for entities, relations, and many more. They provide a dense vector to represent high-dimensional, sparse data in a compact way. Such embedding models have been show to improve the results of many text mining tasks. Further, combining these representations can reveal new insights. We investigate how these models can be used for text mining and develop new models for specific text mining tasks, such as splitting of e-mail threads.
Knowledge graphs (KGs) form the basis of modern intelligent search systems -- their network structure helps with the semantic reasoning and interpretation of complex tasks. A KG is a highly dynamic structure in which facts are continuously updated, added, and removed. A typical approach to ensure data quality in the presence of continuous changes is to apply logic rules. These rules are automatically mined from the data using frequency-based approaches. As a result, these approaches depend on the data quality of the KG and are susceptible to errors and incompleteness. To address these issues, we propose COLT, a few-shot rule-based knowledge validation framework that enables the interactive quality assessment of logic rules. It evaluates the quality of any rule by asking a user to validate only a few facts entailed by such rule on the KG. We formalize the problem as learning a validation function over the rule's outcomes and study the theoretical connections to the generalized maximum coverage problem. Our model obtains (i) an accurate estimate of the quality of a rule with fewer than 20 user interactions and (ii) 75% quality (F1) with 5% annotations in the task of validating facts entailed by any rule.
Risch, J., Krestel, R.: Domain-specific word embeddings for patent classification. Data Technologies and Applications. 53, 108–122 (2019).
Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. To examine the novelty of an application it can then be compared to previously granted patents in the same class. Automatic classification would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed to accomplish this costly manual task. However, a challenge for the automation is patent-specific language use, such as special vocabulary and phrases. To account for this language use, we present domain-specific pre-trained word embeddings for the patent domain. We train our model on a very large dataset of more than 5 million patents and evaluate it at the task of patent classification. To this end, we propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings. Experiments on a standardized evaluation dataset show that our approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches. In this paper, we further investigate the model’s strengths and weaknesses. An extensive error analysis reveals that the learned embeddings indeed mirror patent-specific language use. The imbalanced training data and underrepresented classes are the most difficult remaining challenge.
Risch, J., Krestel, R.: Learning Patent Speak: Investigating Domain-Specific Word Embeddings. Proceedings of the Thirteenth International Conference on Digital Information Management (ICDIM). pp. 63–68 (2018).
A patent examiner needs domain-specific knowledge to classify a patent application according to its field of invention. Standardized classification schemes help to compare a patent application to previously granted patents and thereby check its novelty. Due to the large volume of patents, automatic patent classification would be highly beneficial to patent offices and other stakeholders in the patent domain. However, a challenge for the automation of this costly manual task is the patent-specific language use. To facilitate this task, we present domain-specific pre-trained word embeddings for the patent domain. We trained our model on a very large dataset of more than 5 million patents to learn the language use in this domain. We evaluated the quality of the resulting embeddings in the context of patent classification. To this end, we propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings. Experiments on a standardized evaluation dataset show that our approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches.
Bunk, S., Krestel, R.: WELDA: Enhancing Topic Models by Incorporating Local Word Contexts. Joint Conference on Digital Libraries (JCDL 2018). ACM, Forth Worth, Texas, USA (2018).
The distributional hypothesis states that similar words tend to have similar contexts in which they occur. Word embedding models exploit this hypothesis by learning word vectors based on the local context of words. Probabilistic topic models on the other hand utilize word co-occurrences across documents to identify topically related words. Due to their complementary nature, these models define different notions of word similarity, which, when combined, can produce better topical representations. In this paper we propose WELDA, a new type of topic model, which combines word embeddings (WE) with latent Dirichlet allocation (LDA) to improve topic quality. We achieve this by estimating topic distributions in the word embedding space and exchanging selected topic words via Gibbs sampling from this space. We present an extensive evaluation showing that WELDA cuts runtime by at least 30% while outperforming other combined approaches with respect to topic coherence and for solving word intrusion tasks.
Repke, T., Krestel, R.: Topic-aware Network Visualisation to Explore Large Email Corpora. International Workshop on Big Data Visual Exploration and Analytics (BigVis). (2018).
Nowadays, more and more large datasets exhibit an intrinsic graph structure. While there exist special graph databases to handle ever increasing amounts of nodes and edges, visualising this data becomes infeasible quickly with growing data. In addition, looking at its structure is not sufficient to get an overview of a graph dataset. Indeed, visualising additional information about nodes or edges without cluttering the screen is essential. In this paper, we propose an interactive visualisation for social networks that positions individuals (nodes) on a two-dimensional canvas such that communities defined by social links (edges) are easily recognisable. Furthermore, we visualise topical relatedness between individuals by analysing information about social links, in our case email communication. To this end, we utilise document embeddings, which project the content of an email message into a high dimensional semantic space and graph embeddings, which project nodes in a network graph into a latent space reflecting their relatedness.
Repke, T., Krestel, R.: Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks. 40th European Conference on Information Retrieval (ECIR 2018). Springer, Grenoble, France (2018).
Email communication plays an integral part of everybody's life nowadays. Especially for business emails, extracting and analysing these communication networks can reveal interesting patterns of processes and decision making within a company. Fraud detection is another application area where precise detection of communication networks is essential. In this paper we present an approach based on recurrent neural networks to untangle email threads originating from forward and reply behaviour. We further classify parts of emails into 2 or 5 zones to capture not only header and body information but also greetings and signatures. We show that our deep learning approach outperforms state-of-the-art systems based on traditional machine learning and hand-crafted rules. Besides using the well-known Enron email corpus for our experiments, we additionally created a new annotated email benchmark corpus from Apache mailing lists.