Risch, J., Alder, N., Hewel, C., Krestel, R.: PatentMatch: A Dataset for Matching Patent Claims with Prior Art.ArXiv e-prints 2012.13919. (2020).
Patent examiners need to solve a complex information retrieval task when they assess the novelty and inventive step of claims made in a patent application. Given a claim, they search for prior art, which comprises all relevant publicly available information. This time-consuming task requires a deep understanding of the respective technical domain and the patent-domain-specific language. For these reasons, we address the computer-assisted search for prior art by creating a training dataset for supervised machine learning called PatentMatch. It contains pairs of claims from patent applications and semantically corresponding text passages of different degrees from cited patent documents. Each pair has been labeled by technically-skilled patent examiners from the European Patent Office. Accordingly, the label indicates the degree of semantic correspondence (matching), i.e., whether the text passage is prejudicial to the novelty of the claimed invention or not. Preliminary experiments using a baseline system show that PatentMatch can indeed be used for training a binary text pair classifier on this challenging information retrieval task. The dataset is available online: https://hpi.de/naumann/s/patentmatch
Risch, J., Ruff, R., Krestel, R.: Offensive Language Detection Explained.Proceedings of the Workshop on Trolling, Aggression and Cyberbullying (TRAC@LREC). pp. 137-143. European Language Resources Association (ELRA) (2020).
Many online discussion platforms use a content moderation process, where human moderators check user comments for offensive language and other rule violations. It is the moderator's decision which comments to remove from the platform because of violations and which ones to keep. Research so far focused on automating this decision process in the form of supervised machine learning for a classification task. However, even with machine-learned models achieving better classification accuracy than human experts in some scenarios, there is still a reason why human moderators are preferred. In contrast to black-box models, such as neural networks, humans can give explanations for their decision to remove a comment. For example, they can point out which phrase in the comment is offensive or what subtype of offensiveness applies. In this paper, we analyze and compare four attribution-based explanation methods for different offensive language classifiers: an interpretable machine learning model (naive Bayes), a model-agnostic explanation method (LIME), a model-based explanation method (LRP), and a self-explanatory model (LSTM with an attention mechanism). We evaluate these approaches with regard to their explanatory power and their ability to point out which words are most relevant for a classifier's decision. We find that the more complex models achieve better classification accuracy while also providing better explanations than the simpler models.
Ehmüller, J., Kohlmeyer, L., McKee, H., Paeschke, D., Repke, T., Krestel, R., Naumann, F.: Sense Tree: Discovery of New Word Senses with Graph-based Scoring.Proceedings of the Conference on "Lernen, Wissen, Daten, Analysen" (LWDA). p. 1--12 (2020).
Language is dynamic and constantly evolving: both the us-age context and the meaning of words change over time. Identifying words that acquired new meanings and the point in time at which new word senses emerged is elementary for word sense disambiguation and entity linking in historical texts. For example, cloud once stood mostly for the weather phenomenon and only recently gained the new sense of cloud computing. We propose a clustering-based approach that computes sense trees, showing how meanings of words change over time. The produced results are easy to interpret and explain using a drill-down mechanism. We evaluate our approach qualitatively on the Corpus of Historic American English (COHA), which spans two hundred years.
Bejnordi, A.E., Krestel, R.: Dynamic Channel and Layer Gating in Convolutional Neural Networks.Proceedings of the 43rd German Conference on Artificial Intelligence (KI 2020) (2020).
Convolutional neural networks (CNN) are getting more and more complex, needing enormous computing resources and energy. In this paper, we propose methods for conditional computation in the context of image classification that allows a CNN to dynamically use its channels and layers conditioned on the input. To this end, we combine light-weight gating modules that can make binary decisions without causing much computational overhead. We argue, that combining the recently proposed channel gating mechanism with layer gating can significantly reduce the computational cost of large CNNs. Using discrete optimization algorithms, the gating modules are made aware of the context in which they are used and decide whether a particular channel and/or a particular layer will be executed. This results in neural networks that adapt their own topology conditioned on the input image. Experiments using the CIFAR10 and MNIST datasets show how competitive results in image classification with respect to accuracy can be achieved while saving up to 50% computational resources.
Jain, N., Bartz, C., Krestel, R.: Automatic Matching of Paintings and Descriptions in Art-Historic Archives using Multimodal Analysis.1st International Workshop on Artificial Intelligence for Historical Image Enrichment and Access (AI4HI-2020), co-located with LREC 2020 conference (2020).
Cultural heritage data plays a pivotal role in the understanding of human history and culture. A wealth of information is buried in art-historic archives which can be extracted via their digitization and analysis. This information can facilitate search and browsing, help art historians to track the provenance of artworks and enable wider semantic text exploration for digital cultural resources. However, this information is contained in images of artworks as well as textual descriptions, or annotations accompanied with the images. During the digitization of such resources, the valuable associations between the images and texts are frequently lost. In this project description, we propose an approach to retrieve the associations between images and texts for artworks from art-historic archives. To this end, we use machine learning to generate text descriptions for the extracted images on the one hand, and to detect descriptive phrases and titles of images from the text on the other hand. Finally, we use embeddings to align both, the descriptions and the images.
Jain, N., Krestel, R.: Learning Fine-Grained Semantics for Multi-Relational Data.International Semantic Web Conference, 2020 Posters and Demos (2020).
The semantics of relations play a central role in the understanding and analysis of multi-relational data. Real-world relational datasets represented by knowledge graphs often contain polysemous relations between different types of entities, that represent multiple semantics. In this work, we present a data-driven method that can automatically discover the distinct semantics associated with high-level relations and derive an optimal number of sub-relations having fine-grained meaning. To this end, we perform clustering over vector representations of entities and relations obtained from knowledge graph embedding models.
Lazaridou, K., Löser, A., Mestre, M., Naumann, F.: Discovering Biased News Articles Leveraging Multiple Human Annotations.Proceedings of the Conference on Language Resources and Evaluation (LREC). pp. 1268–1277 (2020).
Risch, J., Künstler, V., Krestel, R.: HyCoNN: Hybrid Cooperative Neural Networks for Personalized News Discussion Recommendation.Proceedings of the International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (WI-IAT) (2020).
Many online news platforms provide comment sections for reader discussions below articles. While users of these platforms often read comments, only a minority of them regularly write comments. To encourage and foster more frequent engagement, we study the task of personalized recommendation of reader discussions to users. We present a neural network model that jointly learns embeddings for users and comments encoding general properties. Based on explicit and implicit user feedback, we sample relevant and irrelevant reader discussions to build a representative training dataset. We compare to several baselines and state-of-the-art approaches in an evaluation on two datasets from The Guardian and Daily Mail. Experimental results show that the recommendations of our approach are superior in terms of precision and recall. Further, the learned user embeddings are of general applicability because they preserve the similarity of users who share interests in similar topics.
Risch, J., Krestel, R.: A Dataset of Journalists' Interactions with Their Readership: When Should Article Authors Reply to Reader Comments?Proceedings of the International Conference on Information and Knowledge Management (CIKM). pp. 3117-3124. ACM (2020).
The comment sections of online news platforms are an important space to indulge in political conversations andto discuss opinions. Although primarily meant as forums where readers discuss amongst each other, they can also spark a dialog with the journalists who authored the article. A small but important fraction of comments address the journalists directly, e.g., with questions, recommendations for future topics, thanks and appreciation, or article corrections. However, the sheer number of comments makes it infeasible for journalists to follow discussions around their articles in extenso. A better understanding of this data could support journalists in gaining insights into their audience and fostering engaging and respectful discussions. To this end, we present a dataset of dialogs in which journalists of The Guardian replied to reader comments and identify the reasons why. Based on this data, we formulate the novel task of recommending reader comments to journalists that are worth reading or replying to, i.e., ranking comments in such a way that the top comments are most likely to require the journalists' reaction. As a baseline, we trained a neural network model with the help of a pair-wise comment ranking task. Our experiment reveals the challenges of this task and we outline promising paths for future work. The data and our code are available for research purposes from: hpi.de/naumann/projects/repeatability/text-mining.html.
Repke, T., Krestel, R.: Visualising Large Document Collections by Jointly Modeling Text and Network Structure.Proceedings of the Joint Conference on Digital Libraries (JCDL). (2020).
Many large text collections exhibit graph structures, either inherent to the content itself or encoded in the metadata of the individual documents. Example graphs extracted from document collections are co-author networks, citation networks, or named-entity-cooccurrence networks. Furthermore, social networks can be extracted from email corpora, tweets, or social media. When it comes to visualising these large corpora, either the textual content or the network graph are used. In this paper, we propose to incorporate both, text and graph, to not only visualise the semantic information encoded in the documents' content but also the relationships expressed by the inherent network structure. To this end, we introduce a novel algorithm based on multi-objective optimisation to jointly position embedded documents and graph nodes in a two-dimensional landscape. We illustrate the effectiveness of our approach with real-world datasets and show that we can capture the semantics of large document collections better than other visualisations based on either the content or the network information.
Risch, J., Garda, S., Krestel, R.: Hierarchical Document Classification as a Sequence Generation Task.Proceedings of the Joint Conference on Digital Libraries (JCDL). pp. 147-155 (2020).
Hierarchical classification schemes are an effective and natural way to organize large document collections. However, complex schemes make the manual classification time-consuming and require domain experts. Current machine learning approaches for hierarchical classification do not exploit all the information contained in the hierarchical schemes. During training, they do not make full use of the inherent parent-child relation of classes. For example, they neglect to tailor document representations, such as embeddings, to each individual hierarchy level. Our model overcomes these problems by addressing hierarchical classification as a sequence generation task. To this end, our neural network transforms a sequence of input words into a sequence of labels, which represents a path through a tree-structured hierarchy scheme. The evaluation uses a patent corpus, which exhibits a complex class hierarchy scheme and high-quality annotations from domain experts and comprises millions of documents. We re-implemented five models from related work and show that our basic model achieves competitive results in comparison with the best approach. A variation of our model that uses the recent Transformer architecture outperforms the other approaches. The error analysis reveals that the encoder of our model has the strongest influence on its classification performance.
Risch, J., Krestel, R.: Bagging BERT Models for Robust Aggression Identification.Proceedings of the Workshop on Trolling, Aggression and Cyberbullying (TRAC@LREC). pp. 55-61. European Language Resources Association (ELRA) (2020).
Modern transformer-based models with hundreds of millions of parameters, such as BERT, achieve impressive results at text classification tasks. This also holds for aggression identification and offensive language detection, where deep learning approaches consistently outperform less complex models, such as decision trees. While the complex models fit training data well (low bias), they also come with an unwanted high variance. Especially when fine-tuning them on small datasets, the classification performance varies significantly for slightly different training data. To overcome the high variance and provide more robust predictions, we propose an ensemble of multiple fine-tuned BERT models based on bootstrap aggregating (bagging). In this paper, we describe such an ensemble system and present our submission to the shared tasks on aggression identification 2020 (team name: Julian). Our submission is the best-performing system for five out of six subtasks. For example, we achieve a weighted F1-score of 80.3% for task A on the test dataset of English social media posts. In our experiments, we compare different model configurations and vary the number of models used in the ensemble. We find that the F1-score drastically increases when ensembling up to 15 models, but the returns diminish for more models.
Jain, N.: Domain-Specific Knowledge Graph Construction for Semantic Analysis.Extended Semantic Web Conference (ESWC 2020) Ph.D. Symposium (2020).
Knowledge graphs are widely used for systematic representation of real world data. They serve as a backbone for a number of applications such as search, questions answering and recommendations. Large scale, general purpose knowledge graphs, having millions of facts, have been constructed through automated techniques from publicly available datasets such as Wikipedia. However, these knowledge graphs are typically incomplete and often fail to correctly capture the semantics of the data. This holds true particularly for domain-specific data, where the generic techniques for automated knowledge graph creation often fail due to novel challenges, such as lack of training data, semantic ambiguities and absence of representative ontologies. In this thesis, we focus on automated knowledge graph constriction for the cultural heritage domain. We investigate the research challenges encountered during the creation of an ontology and a knowledge graph from digitized collections of cultural heritage data based on machine learning approaches. We identify the specific research problems for this task and present our methodology and approach for a solution along with preliminary results.
Risch, J., Krestel, R.: Top Comment or Flop Comment? Predicting and Explaining User Engagement in Online News Discussions.Proceedings of the International Conference on Web and Social Media (ICWSM). pp. 579-589. AAAI (2020).
Comment sections below online news articles enjoy growing popularity among readers. However, the overwhelming number of comments makes it infeasible for the average news consumer to read all of them and hinders engaging discussions. Most platforms display comments in chronological order, which neglects that some of them are more relevant to users and are better conversation starters. In this paper, we systematically analyze user engagement in the form of the upvotes and replies that a comment receives. Based on comment texts, we train a model to distinguish comments that have either a high or low chance of receiving many upvotes and replies. Our evaluation on user comments from TheGuardian.com compares recurrent and convolutional neural network models, and a traditional feature-based classifier. Further, we investigate what makes some comments more engaging than others. To this end, we identify engagement triggers and arrange them in a taxonomy. Explanation methods for neural networks reveal which input words have the strongest influence on our model's predictions. In addition, we evaluate on a dataset of product reviews, which exhibit similar properties as user comments, such as featuring upvotes for helpfulness.
Jain, N., Bartz, C., Bredow, T., Metzenthin, E., Otholt, J., Krestel, R.: Semantic Analysis of Cultural Heritage Data: Aligning Paintings and Descriptions in Art-Historic Collections.International Workshop on Fine Art Pattern Extraction and Recognition in conjunction with the 25th International Conference on Pattern Recognition (ICPR 2020) (2020).
Art-historic documents often contain multimodal data in terms of images of artworks and metadata, descriptions, or interpretations thereof. Most research efforts have focused either on image analysis or text analysis independently since the associations between the two modes are usually lost during digitization. In this work, we focus on the task of alignment of images and textual descriptions in art-historic digital collections. To this end, we reproduce an existing approach that learns alignments in a semi-supervised fashion. We identify several challenges while automatically aligning images and texts, specifically for the cultural heritage domain, which limit the scalability of previous works. To improve the performance of alignment, we introduce various enhancements to extend the existing approach that show promising results.
Repke, T., Krestel, R.: Exploration Interface for Jointly Visualised Text and Graph Data.International Conference on Intelligent User Interfaces Companion (IUI '20). (2020).
Many large text collections exhibit graph structures, either inherent to the content itself or encoded in the metadata of the individual documents. Example graphs extracted from document collections are co-author networks, citation networks, or named-entity-co-occurrence networks. Furthermore, social networks can be extracted from email corpora, tweets, or social media. When it comes to visualising these large corpora, traditionally either the textual content or the network graph are used. We propose to incorporate both, text and graph, to not only visualise the semantic information encoded in the documents’ content but also the relationships expressed by the inherent network structure in a two-dimensional landscape. We illustrate the effectiveness of our approach with an exploration interface for different real world datasets.
Hacker, P., Krestel, R., Grundmann, S., Naumann, F.: Explainable AI under Contract and Tort Law: Legal Incentives and Technical Challenges.Artificial Intelligence and Law. (2020).
Risch, J., Krestel, R.: Toxic Comment Detection in Online Discussions. In: Agarwal, B., Nayak, R., Mittal, N., and Patnaik, S. (eds.) Deep Learning-Based Approaches for Sentiment Analysis. pp. 85-109. Springer (2020).
With the exponential growth in the use of social media networks such as Twitter, Facebook, and many others, an astronomical amount of big data has been generated. A substantial amount of this user-generated data is in form of text such as reviews, tweets, and blogs that provide numerous challenges as well as opportunities to NLP (Natural Language Processing) researchers for discovering meaningful information used in various applications. Sentiment analysis is the study that analyses people’s opinion and sentiment towards entities such as products, services, person, organisations etc. present in the text. Sentiment analysis and opinion mining is the most popular and interesting research problem. In recent years, Deep Learning approaches have emerged as powerful computational models and have shown significant success to deal with a massive amount of data in unsupervised settings. Deep learning is revolutionizing because it offers an effective way of learning representation and allows the system to learn features automatically from data without the need of explicitly designing them. Deep learning algorithms such as deep autoencoders, convolutional and recurrent neural networks (CNN) (RNN), Long Short-Term Memory (LSTM) and Generative Adversarial Networks (GAN) have reported providing significantly improved results in various natural language processing tasks including sentiment analysis.
Risch, J., Ruff, R., Krestel, R.: Explaining Offensive Language Detection.Journal for Language Technology and Computational Linguistics (JLCL).34,29-47 (2020).
Machine learning approaches have proven to be on or even above human-level accuracy for the task of offensive language detection. In contrast to human experts, however, they often lack the capability of giving explanations for their decisions. This article compares four different approaches to make offensive language detection explainable: an interpretable machine learning model (naive Bayes), a model-agnostic explainability method (LIME), a model-based explainability method (LRP), and a self-explanatory model (LSTM with an attention mechanism). Three different classification methods: SVM, naive Bayes, and LSTM are paired with appropriate explanation methods. To this end, we investigate the trade-off between classification performance and explainability of the respective classifiers. We conclude that, with the appropriate explanation methods, the superior classification performance of more complex models is worth the initial lack of explainability.
Risch, J., Stoll, A., Ziegele, M., Krestel, R.: hpiDEDIS at GermEval 2019: Offensive Language Identification using a German BERT model.Proceedings of the 15th Conference on Natural Language Processing (KONVENS). p. 403--408. German Society for Computational Linguistics & Language Technology, Erlangen, Germany (2019).
Pre-training language representations on large text corpora, for example, with BERT, has recently shown to achieve impressive performance at a variety of downstream NLP tasks. So far, applying BERT to offensive language identification for German- language texts failed due to the lack of pre-trained, German-language models. In this paper, we fine-tune a BERT model that was pre-trained on 12 GB of German texts to the task of offensive language identification. This model significantly outperforms our baselines and achieves a macro F1 score of 76% on coarse-grained, 51% on fine-grained, and 73% on implicit/explicit classification. We analyze the strengths and weaknesses of the model and derive promising directions for future work.
Jain, N., Krestel, R.: Who is Mona L.? Identifying Mentions of Artworks in Historical Archives.International Conference on Theory and Practice of Digital Libraries (TPDL 2019). p. 115--122. Springer (2019).
Named entity recognition (NER) plays an important role in many information retrieval tasks, including automatic knowledge graph construction. Most NER systems are typically limited to a few common named entity types, such as person, location, and organization. However, for cultural heritage resources, such as art historical archives, the recognition of titles of artworks as named entities is of high importance. In this work, we focus on identifying mentions of artworks, e.g. paintings and sculptures, from historical archives. Current state of the art NER tools are unable to adequately identify artwork titles due to the particular difficulties presented by this domain. The scarcity of training data for NER for cultural heritage poses further hindrances. To mitigate this, we propose a semi-supervised approach to create high-quality training data by leveraging existing cultural heritage resources. Our experimental evaluation shows significant improvement in NER performance for artwork titles as compared to baseline approach.
Risch, J., Krestel, R.: Measuring and Facilitating Data Repeatability in Web Science.Datenbank-Spektrum.19,117-126 (2019).
Accessible and reusable datasets are a necessity to accomplish repeatable research. This requirement poses a problem particularly for web science, since scraped data comes in various formats and can change due to the dynamic character of the web. Further, usage of web data is typically restricted by copyright-protection or privacy regulations, which hinder publication of datasets. To alleviate these problems and reach what we define as “partial data repeatability”, we present a process that consists of multiple components. Researchers need to distribute only a scraper and not the data itself to comply with legal limitations. If a dataset is re-scraped for repeatability after some time, the integrity of different versions can be checked based on fingerprints. Moreover, fingerprints are sufficient to identify what parts of the data have changed and how much. We evaluate an implementation of this process with a dataset of 250 million online comments collected from five different news discussion platforms. We re-scraped the dataset after pausing for one year and show that less than ten percent of the data has actually changed. These experiments demonstrate that providing a scraper and fingerprints enables recreating a dataset and supports the repeatability of web science experiments.
Risch, J., Krestel, R.: Domain-specific word embeddings for patent classification.Data Technologies and Applications.53,108-122 (2019).
Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. To examine the novelty of an application it can then be compared to previously granted patents in the same class. Automatic classification would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed to accomplish this costly manual task. However, a challenge for the automation is patent-specific language use, such as special vocabulary and phrases. To account for this language use, we present domain-specific pre-trained word embeddings for the patent domain. We train our model on a very large dataset of more than 5 million patents and evaluate it at the task of patent classification. To this end, we propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings. Experiments on a standardized evaluation dataset show that our approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches. In this paper, we further investigate the model’s strengths and weaknesses. An extensive error analysis reveals that the learned embeddings indeed mirror patent-specific language use. The imbalanced training data and underrepresented classes are the most difficult remaining challenge.
Kellermeier, T., Repke, T., Krestel, R.: Mining Business Relationships from Stocks and News.MIDAS@ECML-PKDD. (2019).
In today’s modern society and global economy, decision making processes are increasingly supported by data. Especially in financial businesses it is essential to know about how the players in our global or national market are connected. In this work we compare different approaches for creating company relationship graphs. In our evaluation we see similarities in relationships extracted from Bloomberg and Reuters business news and correlations in historic stock market data.
Bunk, S., Krestel, R.: WELDA: Enhancing Topic Models by Incorporating Local Word Contexts.Joint Conference on Digital Libraries (JCDL 2018). ACM, Forth Worth, Texas, USA (2018).
The distributional hypothesis states that similar words tend to have similar contexts in which they occur. Word embedding models exploit this hypothesis by learning word vectors based on the local context of words. Probabilistic topic models on the other hand utilize word co-occurrences across documents to identify topically related words. Due to their complementary nature, these models define different notions of word similarity, which, when combined, can produce better topical representations. In this paper we propose WELDA, a new type of topic model, which combines word embeddings (WE) with latent Dirichlet allocation (LDA) to improve topic quality. We achieve this by estimating topic distributions in the word embedding space and exchanging selected topic words via Gibbs sampling from this space. We present an extensive evaluation showing that WELDA cuts runtime by at least 30% while outperforming other combined approaches with respect to topic coherence and for solving word intrusion tasks.
Repke, T., Krestel, R.: Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks.40th European Conference on Information Retrieval (ECIR 2018). Springer, Grenoble, France (2018).
Email communication plays an integral part of everybody's life nowadays. Especially for business emails, extracting and analysing these communication networks can reveal interesting patterns of processes and decision making within a company. Fraud detection is another application area where precise detection of communication networks is essential. In this paper we present an approach based on recurrent neural networks to untangle email threads originating from forward and reply behaviour. We further classify parts of emails into 2 or 5 zones to capture not only header and body information but also greetings and signatures. We show that our deep learning approach outperforms state-of-the-art systems based on traditional machine learning and hand-crafted rules. Besides using the well-known Enron email corpus for our experiments, we additionally created a new annotated email benchmark corpus from Apache mailing lists.
Repke, T., Krestel, R.: Topic-aware Network Visualisation to Explore Large Email Corpora.International Workshop on Big Data Visual Exploration and Analytics (BigVis). (2018).
Nowadays, more and more large datasets exhibit an intrinsic graph structure. While there exist special graph databases to handle ever increasing amounts of nodes and edges, visualising this data becomes infeasible quickly with growing data. In addition, looking at its structure is not sufficient to get an overview of a graph dataset. Indeed, visualising additional information about nodes or edges without cluttering the screen is essential. In this paper, we propose an interactive visualisation for social networks that positions individuals (nodes) on a two-dimensional canvas such that communities defined by social links (edges) are easily recognisable. Furthermore, we visualise topical relatedness between individuals by analysing information about social links, in our case email communication. To this end, we utilise document embeddings, which project the content of an email message into a high dimensional semantic space and graph embeddings, which project nodes in a network graph into a latent space reflecting their relatedness.
Risch, J., Krestel, R.: My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections.Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL). pp. 283-292 (2018).
Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such as digital libraries. However, topic modeling on documents from different collections is challenging because of domain-specific vocabulary. We present a cross-collection topic model combined with automatic domain term extraction and phrase segmentation. This model distinguishes collection-specific and collection-independent words based on information entropy and reveals commonalities and differences of multiple text collections. We evaluate our model on patents, scientific papers, newspaper articles, forum posts, and Wikipedia articles. In comparison to state-of-the-art cross-collection topic modeling, our model achieves up to 13% higher topic coherence, up to 4% lower perplexity, and up to 31% higher document classification accuracy. More importantly, our approach is the first topic model that ensures disjunct general and specific word distributions, resulting in clear-cut topic representations.
Lazaridou, K., Gruetze, T., Naumann, F.: Where in the World Is Carmen Sandiego? Detecting Person Locations via Social Media Discussions.Proceedings of the ACM Conference on Web Science. ACM (2018).
In today's social media, news often spread faster than in mainstream media, along with additional context and aspects about the current affairs. Consequently, users in social networks are up-to-date with the details of real-world events and the involved individuals. Examples include crime scenes and potential perpetrator descriptions, public gatherings with rumors about celebrities among the guests, rallies by prominent politicians, concerts by musicians, etc. We are interested in the problem of tracking persons mentioned in social media, namely detecting the locations of individuals by leveraging the online discussions about them. Existing literature focuses on the well-known and more convenient problem of user location detection in social media, mainly as the location discovery of the user profiles and their messages. In contrast, we track individuals with text mining techniques, regardless whether they hold a social network account or not. We observe what the community shares about them and estimate their locations. Our approach consists of two steps: firstly, we introduce a noise filter that prunes irrelevant posts using a recursive partitioning technique. Secondly, we build a model that reasons over the set of messages about an individual and determines his/her locations. In our experiments, we successfully trace the last U.S. presidential candidates through millions of tweets published from November 2015 until January 2017. Our results outperform previously introduced techniques and various baselines.
Ambroselli, C., Risch, J., Krestel, R., Loos, A.: Prediction for the Newsroom: Which Articles Will Get the Most Comments?Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). pp. 193-199. ACL, New Orleans, Louisiana, USA (2018).
The overwhelming success of the Web and mobile technologies has enabled millions to share their opinions publicly at any time. But the same success also endangers this freedom of speech due to closing down of participatory sites misused by individuals or interest groups. We propose to support manual moderation by proactively drawing the attention of our moderators to article discussions that most likely need their intervention. To this end, we predict which articles will receive a high number of comments. In contrast to existing work, we enrich the article with metadata, extract semantic and linguistic features, and exploit annotated data from a foreign language corpus. Our logistic regression model improves F1-scores by over 80% in comparison to state-of-the-art approaches.
Repke, T., Krestel, R., Edding, J., Hartmann, M., Hering, J., Kipping, D., Schmidt, H., Scordialo, N., Zenner, A.: Beacon in the Dark: A System for Interactive Exploration of Large Email Corpora.Proceedings of the International Conference on Information and Knowledge Management (CIKM). p. 1--4. ACM (2018).
Emails play a major role in today's business communication, documenting not only work but also decision making processes. The large amount of heterogeneous data in these email corpora renders manual investigations by experts infeasible. Auditors or jornalists, e.g., who are looking for irregular or inappropriate content or suspicous patterns, are in desperate need for computer-aided exploration tools to support their investigations. We present our Beacon system for the exploration of such corpora at different levels of detail. A distributed processing pipeline combines text mining methods and social network analysis to augment the already semi-structured nature of emails. The user interface ties into the resulting cleaned and enriched dataset. For the interface design we identify three objectives expert users have: gain an initial overview of the data to identify leads to investigate, understand the context of the information at hand, and have meaningful filters to iteratively focus onto a subset of emails. To this end we make use of interactive visualisations for rearranging and aggregating the extracted information to reveal salient patterns.
Risch, J., Krestel, R.: Delete or not Delete? Semi-Automatic Comment Moderation for the Newsroom.Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (co-located with COLING). pp. 166-176 (2018).
Comment sections of online news providers have enabled millions to share and discuss their opinions on news topics. Today, moderators ensure respectful and informative discussions by deleting not only insults, defamation, and hate speech, but also unverifiable facts. This process has to be transparent and comprehensive in order to keep the community engaged. Further, news providers have to make sure to not give the impression of censorship or dissemination of fake news. Yet manual moderation is very expensive and becomes more and more unfeasible with the increasing amount of comments. Hence, we propose a semi-automatic, holistic approach, which includes comment features but also their context, such as information about users and articles. For evaluation, we present experiments on a novel corpus of 3 million news comments annotated by a team of professional moderators.
van Aken, B., Risch, J., Krestel, R., Löser, A.: Challenges for Toxic Comment Classification: An In-Depth Error Analysis.Proceedings of the 2nd Workshop on Abusive Language Online (co-located with EMNLP). pp. 33-42 (2018).
Toxic comment classification has become an active research field with many recently proposed approaches. However, while these approaches address some of the task’s challenges others still remain unsolved and directions for further research are needed. To this end, we compare different approaches on a new, large comment dataset and propose an ensemble that outperforms all individual models. Further, we validate our findings on a second dataset. The results of the ensemble enable us to perform an extensive error analysis, which reveals open challenges for state-of- the-art methods and directions towards pending future research. These challenges include missing paradigmatic context and inconsistent dataset labels.
Risch, J., Krestel, R.: Learning Patent Speak: Investigating Domain-Specific Word Embeddings.Proceedings of the Thirteenth International Conference on Digital Information Management (ICDIM). pp. 63-68 (2018).
A patent examiner needs domain-specific knowledge to classify a patent application according to its field of invention. Standardized classification schemes help to compare a patent application to previously granted patents and thereby check its novelty. Due to the large volume of patents, automatic patent classification would be highly beneficial to patent offices and other stakeholders in the patent domain. However, a challenge for the automation of this costly manual task is the patent-specific language use. To facilitate this task, we present domain-specific pre-trained word embeddings for the patent domain. We trained our model on a very large dataset of more than 5 million patents to learn the language use in this domain. We evaluated the quality of the resulting embeddings in the context of patent classification. To this end, we propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings. Experiments on a standardized evaluation dataset show that our approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches.
Risch, J., Krebs, E., Löser, A., Riese, A., Krestel, R.: Fine-Grained Classification of Offensive Language.Proceedings of GermEval (co-located with KONVENS). pp. 38-44 (2018).
Social media platforms receive massive amounts of user-generated content that may include offensive text messages. In the context of the GermEval task 2018, we propose an approach for fine-grained classification of offensive language. Our approach comprises a Naive Bayes classifier, a neural network, and a rule-based approach that categorize tweets. In addition, we combine the approaches in an ensemble to overcome weaknesses of the single models. We cross-validate our approaches with regard to macro-average F1-score on the provided training dataset.
Risch, J., Garda, S., Krestel, R.: Book Recommendation Beyond the Usual Suspects: Embedding Book Plots Together with Place and Time Information.Proceedings of the 20th International Conference On Asia-Pacific Digital Libraries (ICADL). pp. 227-239 (2018).
Content-based recommendation of books and other media is usually based on semantic similarity measures. While metadata can be compared easily, measuring the semantic similarity of narrative literature is challenging. Keyword-based approaches are biased to retrieve books of the same series or do not retrieve any results at all in sparser libraries. We propose to represent plots with dense vectors to foster semantic search for similar plots even if they do not have any words in common. Further, we propose to embed plots, places, and times in the same embedding space. Thereby, we allow arithmetics on these aspects. For example, a book with a similar plot but set in a different, user-specified place can be retrieved. We evaluate our findings on a set of 16,000 book synopses that spans literature from 500 years and 200 genres and compare our approach to a keyword-based baseline.
Loster, M., Repke, T., Krestel, R., Naumann, F., Ehmueller, J., Feldmann, B., Maspfuhl, O.: The Challenges of Creating, Maintaining and Exploring Graphs of Financial Entities.Proceedings of the Fourth International Workshop on Data Science for Macro-Modeling (DSMM 2018). ACM (2018).
The integration of a wide range of structured and unstructured information sources into a uniformly integrated knowledge base is an important task in the financial sector. As an example, modern risk analysis methods can benefit greatly from an integrated knowledge base, building in particular a dedicated, domain-specific knowledge graph. Knowledge graphs can be used to gain a holistic view of the current economic situation so that systemic risks can be identified early enough to react appropriately. The use of this graphical structure thus allows the investigation of many financial scenarios, such as the impact of corporate bankruptcy on other market participants within the network. In this particular scenario, the links between the individual market participants can be used to determine which companies are affected by a bankruptcy and to what extent. We took these considerations as a motivation to start the development of a system capable of constructing and maintaining a knowledge graph of financial entities and their relationships. The envisioned system generates this particular graph by extracting and combining information from both structured data sources such as Wikidata and DBpedia, as well as from unstructured data sources such as newspaper articles and financial filings. In addition, the system should incorporate proprietary data sources, such as financial transactions (structured) and credit reports (unstructured). The ultimate goal is to create a system that recognizes financial entities in structured and unstructured sources, links them with the information of a knowledge base, and then extracts the relations expressed in the text between the identified entities. The constructed knowledge base can be used to construct the desired knowledge graph. Our system design consists of several components, each of which addresses a specific subproblem. To this end, Figure 1 gives a general overview of our system and its subcomponents.
Risch, J., Krestel, R.: Aggression Identification Using Deep Learning and Data Augmentation.Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (co-located with COLING). pp. 150-158 (2018).
Social media platforms allow users to share and discuss their opinions online. However, a minority of user posts is aggressive, thereby hinders respectful discussion, and — at an extreme level — is liable to prosecution. The automatic identification of such harmful posts is important, because it can support the costly manual moderation of online discussions. Further, the automation allows unprecedented analyses of discussion datasets that contain millions of posts. This system description paper presents our submission to the First Shared Task on Aggression Identification. We propose to augment the provided dataset to increase the number of labeled comments from 15,000 to 60,000. Thereby, we introduce linguistic variety into the dataset. As a consequence of the larger amount of training data, we are able to train a special deep neural net, which generalizes especially well to unseen data. To further boost the performance, we combine this neural net with three logistic regression classifiers trained on character and word n-grams, and hand-picked syntactic features. This ensemble is more robust than the individual single models. Our team named “Julian” achieves an F1-score of 60% on both English datasets, 63% on the Hindi Facebook dataset, and 38% on the Hindi Twitter dataset.
Repke, T., Loster, M., Krestel, R.: Comparing Features for Ranking Relationships Between Financial Entities Based on Text.Proceedings of the 3rd International Workshop on Data Science for Macro--Modeling with Financial and Economic Datasets. p. 12:1--12:2. ACM, New York, NY, USA (2017).
Evaluating the credibility of a company is an important and complex task for financial experts. When estimating the risk associated with a potential asset, analysts rely on large amounts of data from a variety of different sources, such as newspapers, stock market trends, and bank statements. Finding relevant information, such as relationships between financial entities, in mostly unstructured data is a tedious task and examining all sources by hand quickly becomes infeasible. In this paper, we propose an approach to rank extracted relationships based on text snippets, such that important information can be displayed more prominently. Our experiments with different numerical representations of text have shown, that ensemble of methods performs best on labelled data provided for the FEIII Challenge 2017.
Risch, J., Krestel, R.: What Should I Cite? Cross-Collection Reference Recommendation of Patents and Papers.Proceedings of the International Conference on Theory and Practice of Digital Libraries (TPDL). pp. 40-46 (2017).
Research results manifest in large corpora of patents and scientific papers. However, both corpora lack a consistent taxonomy and references across different document types are sparse. Therefore, and because of contrastive, domain-specific language, recommending similar papers for a given patent (or vice versa) is challenging. We propose a hybrid recommender system that leverages topic distributions and key terms to recommend related work despite these challenges. As a case study, we evaluate our approach on patents and papers of two fields: medical and computer science. We find that topic-based recommenders complement term-based recommenders for documents with collection-specific language and increase mean average precision by up to 23%. As a result of our work, publications from both corpora form a joint digital library, which connects academia and industry.
Zuo, Z., Loster, M., Krestel, R., Naumann, F.: Uncovering Business Relationships: Context-sensitive Relationship Extraction for Difficult Relationship Types.Proceedings of the Conference "Lernen, Wissen, Daten, Analysen" (LWDA) (2017).
This paper establishes a semi-supervised strategy for extracting various types of complex business relationships from textual data by using only a few manually provided company seed pairs that exemplify the target relationship. Additionally, we offer a solution for determining the direction of asymmetric relationships, such as “ownership of”. We improve the reliability of the extraction process by using a holistic pattern identification method that classifies the generated extraction patterns. Our experiments show that we can accurately and reliably extract new entity pairs occurring in the target relationship by using as few as five labeled seed pairs.
Krestel, R., Risch, J.: How Do Search Engines Work? A Massive Open Online Course with 4000 Participants.Proceedings of the Conference Lernen, Wissen, Daten, Analysen. pp. 259-271 (2017).
Massive Open Online Courses (MOOCs) have introduced a new form of education. With thousands of participants per course, lectur- ers are confronted with new challenges in the teaching process. In this pa- per, we describe how we conducted an introductory information retrieval course for participants from all ages and educational backgrounds. We analyze different course phases and compare our experiences with regular on-site information retrieval courses at university.
Gruetze, T., Krestel, R., Lazaridou, K., Naumann, F.: What was Hillary Clinton doing in Katy, Texas?Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, 3-7 April, 2017. ACM (2017).
During the last presidential election in the United States of America, Twitter drew a lot of attention. This is because many leading persons and organizations, such as U.S. president Donald J. Trump, showed a strong affection to this medium. In this work we neglect the political contents and opinions shared on Twitter and focus on the question: Can we determine and track the physical location of the presidential candidates based on posts in the Twittersphere?
Lazaridou, K., Krestel, R., Naumann, F.: Identifying Media Bias by Analyzing Reported Speech.International Conference on Data Mining. IEEE (2017).
Media analysis can reveal interesting patterns in the way newspapers report the news and how these patterns evolve over time. One example pattern is the quoting choices that media make, which could be used as bias indicators. Media slant can be expressed both with the choice of reporting an event, e.g. a person's statement, but also with the words used to describe the event. Thus, automatic discovery of systematic quoting patterns in the news could illustrate to the readers the media' beliefs, such as political preferences. In this paper, we aim to discover political media bias by demonstrating systematic patterns of reporting speech in two major British newspapers. To this end, we analyze news articles from 2000 to 2015. By taking into account different kinds of bias, such as selection, coverage and framing bias, we show that the quoting patterns of newspapers are predictable.
Heller, D., Krestel, R., Ohler, U., Vingron, M., Marsico, A.: ssHMM: Extracting Intuitive Sequence-Structure Motifs from High-Throughput RNA-Binding Protein Data.Nucleic Acid Research.45,11004--11018 (2017).
RNA-binding proteins (RBPs) play an important role in RNA post-transcriptional regulation and recognize target RNAs via sequence-structure motifs. The extent to which RNA structure influences protein binding in the presence or absence of a sequence motif is still poorly understood. Existing RNA motif finders either take the structure of the RNA only partially into account, or employ models which are not directly interpretable as sequence-structure motifs. We developed ssHMM, an RNA motif finder based on a hidden Markov model (HMM) and Gibbs sampling which fully captures the relationship between RNA sequence and secondary structure preference of a given RBP. Compared to previous methods which output separate logos for sequence and structure, it directly produces a combined sequence-structure motif when trained on a large set of sequences. ssHMM’s model is visualized intuitively as a graph and facilitates biological interpretation. ssHMM can be used to find novel bona fide sequence-structure motifs of uncharacterized RBPs, such as the one presented here for the YY1 protein. ssHMM reaches a high motif recovery rate on synthetic data, it recovers known RBP motifs from CLIP-Seq data, and scales linearly on the input size, being considerably faster than MEMERIS and RNAcontext on large datasets while being on par with GraphProt. It is freely available on Github and as a Docker image.
Naumann, F., Krestel, R.: Das Fachgebiet „Informationssysteme“ am Hasso-Plattner-Institut.Datenbankspektrum.17,69-76 (2017).
Krestel, R., Mottin, D., Müller, E. eds: Proceedings of the Conference "Lernen, Wissen, Daten, Analysen", Potsdam, Germany, September 12-14, 2016.CEUR-WS.org (2016).
Jenders, M., Krestel, R., Naumann, F.: Which Answer is Best? Predicting Accepted Answers in MOOC Forums.Proceedings of the 25th International Conference Companion on World Wide Web. pp. 679-684. International World Wide Web Conferences Steering Committee (2016).
Massive Open Online Courses (MOOCs) have grown in reach and importance over the last few years, enabling a vast userbase to enroll in online courses. Besides watching videos, user participate in discussion forums to further their understanding of the course material. As in other community-based question-answering communities, in many MOOC forums a user posting a question can mark the answer they are most satisfied with. In this paper, we present a machine learning model that predicts this accepted answer to a forum question using historical forum data.
Gruetze, T., Krestel, R., Naumann, F.: Topic Shifts in StackOverflow: Ask it like Socrates.Lecture Notes in Computer Science. p. 213--221. Springer (2016).
Community based question-and-answer (Q&A) sites rely on well posed and appropriately tagged questions. However, most platforms have only limited capabilities to support their users in finding the right tags. In this paper, we propose a temporal recommendation model to support users in tagging new questions and thus improve their acceptance in the community. To underline the necessity of temporal awareness of such a model, we first investigate the changes in tag usage and show different types of collective attention in StackOverflow, a community-driven Q&A website for computer programming topics. Furthermore, we examine the changes over time in the correlation between question terms and topics. Our results show that temporal awareness is indeed important for recommending tags in Q&A communities.
Lazaridou, K., Krestel, R.: Identifying Political Bias in News Articles.International Conference on Theory and Practice of Digital Libraries. IEEE Technical Committee on Digital Libraries (2016).
Individuals' political leaning, such as journalists', politicians' etc. often shapes the public opinion over several issues. In the case of online journalism, due to the numerous ongoing events, newspapers have to choose which stories to cover, emphasize on and possibly express their opinion about. These choices depict their profile and could reveal a potential bias towards a certain perspective or political position. Likewise, politicians' choice of language and the issues they broach are an indication of their beliefs and political orientation. Given the amount of user-generated text content online, such as news articles, blog posts, politician statements etc., automatically analyzing this information becomes increasingly interesting, in order to understand what people stand for and how they influence the general public. In this PhD thesis, we analyze UK news corpora along with parliament speeches in order to identify potential political media bias. We currently examine the politicians' mentions and their quotes in news articles and how this referencing pattern evolves in time.
Grundke, M., Jasper, J., Perchyk, M., Sachse, J.P., Krestel, R., Neves, M.: TextAI: Enhancing TextAE with Intelligent Annotation Support.Proceedings of the 7th International Symposium on Semantic Mining in Biomedicine (SMBM 2016). pp. 80-84. CEUR-WS.org (2016).
We present TextAI, an extension to the annotation tool TextAE, that adds support for named-entity recognition and automated relation extraction based on machine learning techniques. Our learning approach is domain-independent and increases the quality of the detected relations with each added training document. We further aim at accelerating and facilitating the manual curation process for natural language documents by supporting simultaneous annotation by multiple users.
Park, J., Blume-Kohout, M., Krestel, R., Nalisnick, E., Smyth, P.: Analyzing NIH Funding Patterns over Time with Statistical Text Analysis.Scholarly Big Data: AI Perspectives, Challenges, and Ideas (SBD 2016) Workshop at AAAI 2016. AAAI (2016).
In the past few years various government funding organizations such as the U.S. National Institutes of Health and the U.S. National Science Foundation have provided access to large publicly-available on-line databases documenting the grants that they have funded over the past few decades. These databases provide an excellent opportunity for the application of statistical text analysis techniques to infer useful quantitative information about how funding patterns have changed over time. In this paper we analyze data from the National Cancer Institute (part of National Institutes of Health) and show how text classification techniques provide a useful starting point for analyzing how funding for cancer research has evolved over the past 20 years in the United States.
Godde, C., Lazaridou, K., Krestel, R.: Classification of German Newspaper Comments.Proceedings of the Conference Lernen, Wissen, Daten, Analysen. pp. 299-310. CEUR-WS.org (2016).
Online news has gradually become an inherent part of many people’s every day life, with the media enabling a social and interactive consumption of news as well. Readers openly express their perspectives and emotions for a current event by commenting news articles. They also form online communities and interact with each other by replying to other users’ comments. Due to their active and significant role in the diffusion of information, automatically gaining insights of these comments’ content is an interesting task. We are especially interested in finding systematic differences among the user comments from different newspapers. To this end, we propose the following classification task: Given a news comment thread of a particular article, identify the newspaper it comes from. Our corpus consists of six well-known German newspapers and their comments. We propose two experimental settings using SVM classifiers build on comment- and article-based features. We achieve precision of up to 90% for individual newspapers.
Naumann, F., Krestel, R.: The Information Systems Group at HPI.SIGMOD Record. (2016).
Schubotz, T., Krestel, R.: Online Temporal Summarization of News Events.Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT). pp. 679-684. IEEE Computer Society (2015).
Nowadays, an ever increasing number of news articles is published on a daily basis. Especially after notable national and international events or disasters, news coverage rises tremendously. Temporal summarization is an approach to automatically summarize such information in a timely manner. Summaries are created incrementally with progressing time, as soon as new information is available. Given a user-defined query, we designed a temporal summarizer based on probabilistic language models and entity recognition. First, all relevant documents and sentences are extracted from a stream of news documents using BM25 scoring. Second, a general query language model is created which is used to detect typical sentences respective to the query with Kullback-Leibler divergence. Based on the retrieval result, this query model is extended over time by terms appearing frequently during the particular event. Our system is evaluated with a document corpus including test data provided by the Text Retrieval Conference (TREC).
Krestel, R., Werkmeister, T., Wiradarma, T.P., Kasneci, G.: Tweet-Recommender: Finding Relevant Tweets for News Articles.Proceedings of the 24th International World Wide Web Conference (WWW). ACM (2015).
Twitter has become a prime source for disseminating news and opinions. However, the length of tweets prohibits detailed descriptions, instead, tweets sometimes contain URLs that link to detailed news articles. In this paper, we devise generic techniques for recommending tweets for any given news article. To evaluate and compare the different techniques, we collected tens of thousands of tweets and news articles and conducted a user study on the relevance of recommendations.
Jenders, M., Lindhauer, T., Kasneci, G., Krestel, R., Naumann, F.: A Serendipity Model For News Recommendation.KI 2015: Advances in Artificial Intelligence - 38th Annual German Conference on AI, Dresden, Germany, September 21-25, 2015, Proceedings. pp. 111-123. Springer (2015).
Recommendation algorithms typically work by suggesting items that are similar to the ones that a user likes, or items that similar users like. We propose a content-based recommendation technique with the focus on serendipity of news recommendations. Serendipitous recommendations have the characteristic of being unexpected yet fortunate and interesting to the user, and thus might yield higher user satisfaction. In our work, we explore the concept of serendipity in the area of news articles and propose a general framework that incorporates the benefits of serendipity- and similarity-based recommendation techniques. An evaluation against other baseline recommendation models is carried out in a user study.
Roick, M., Jenders, M., Krestel, R.: How to Stay Up-to-date on Twitter with General Keywords.Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. CEUR-WS.org (2015).
Microblogging platforms make it easy for users to share information through the publication of short personal messages. However, users are not only interested in sharing, but even more so in consuming information. As a result, they are confronted with new challenges when it comes to retrieving information on microblogging platforms. In this paper we present a query expansion method based on latent topics to support users interested in topical information. Similar to news aggregator sites, our approach identifies subtopics to a given query and provides the user with a quick overview of discussed topics within the microblogging platform. Using a document collection of microblog posts from Twitter as an exemplary microblogging platform, we compare the quality of search results returned by our algorithm with a baseline approach and a state-of-the-art microblog-specific query expansion method. To this end, we introduce a novel, innovative semi-supervised evaluation strategy based on expert Twitter users. In contrast to existing query expansion methods, our approach can be used to aggregate and visualize topical query results based on the calculated topic models, while achieving competitive results for traditional keyword-based search with regards to mean average precision.
Krestel, R., Dokoohaki, N.: Diversifying Customer Review Rankings.Neural Networks.66,36-45 (2015).
E-commerce Web sites owe much of their popularity to consumer reviews accompanying product descriptions. On-line customers spend hours and hours going through heaps of textual reviews to decide which products to buy. At the same time, each popular product has thousands of user-generated reviews, making it impossible for a buyer to read everything. Current approaches to display reviews to users or recommend an individual review for a product are based on the recency or helpfulness of each review. In this paper, we present a framework to rank product reviews by optimizing the coverage of the ranking with respect to sentiment or aspects, or by summarizing all reviews with the top-K reviews in the ranking. To accomplish this, we make use of the assigned star rating for a product as an indicator for a review’s sentiment polarity and compare bag-of-words (language model) with topic models (latent Dirichlet allocation) as a mean to represent aspects. Our evaluation on manually annotated review data from a commercial review Web site demonstrates the effectiveness of our approach, outperforming plain recency ranking by 30% and obtaining best results by combining language and topic model representations.
Gruetze, T., Yao, G., Krestel, R.: Learning Temporal Tagging Behaviour.Proceedings of the 24th International Conference on World Wide Web Companion (WWW). p. 1333--1338. ACM (2015).
Social networking services, such as Facebook, Google+, and Twitter are commonly used to share relevant Web documents with a peer group. By sharing a document with her peers, a user recommends the content for others and annotates it with a short description text. This short description yield many chances for text summarization and categorization. Because today’s social networking platforms are real-time media, the sharing behaviour is subject to many temporal effects, i.e., current events, breaking news, and trending topics. In this paper, we focus on time-dependent hashtag usage of the Twitter community to annotate shared Web-text documents. We introduce a framework for time-dependent hashtag recommendation models and introduce two content-based models. Finally, we evaluate the introduced models with respect to recommendation quality based on a Twitter-dataset consisting of links to Web documents that were aligned with hashtags.