Hasso-Plattner-Institut25 Jahre HPI
Hasso-Plattner-Institut25 Jahre HPI

Toni Grütze

Adding Value to Text with User-generated Content

In recent years, the ever-growing amount of documents on the Web as well as in closed systems for private or business contexts led to a considerable increase of valuable textual information about topics, events, and entities. It is a truism that the majority of information (i.e., business-relevant data) is only available in unstructured textual form. The text mining research field comprises various practice areas that have the common goal of harvesting high-quaiity information from textual data. These information can help addressing users’ information needs.

In this thesis, we utilize the knowledge represented in user-generated content (UGC) originating from various social media services to improve text mining results. These social media platforms provide a plethora of information with varying focuses. In many cases, an essential feature of such services is to share relevant content with a peer group. Thus, the data exchanged in these communities tend to be focused on the interests of the user base. The popularity of sociai media services is growing continuously and the inherent knowledge is availabie to be utilized. We show that this knowledge can be used for three different tasks.

Initially, we demonstrate that when searching persons with ambiguous names, the information from Wikipedia can be bootstrapped to group web search results according to the individuals occurring in the documents. We introduce two models and different means to handle persons missing in the UGC source. We show that the proposed approaches outperform traditional algorithms for search result clustering, Secondly, we discuss how the categorization of texts according to continuously changing community-generated folksonomies helps users to identify new information related to their interests. We specificaily target temporal changes in the UGC and show how they influence the quality of different tag recommendation approaches. Finally, we introduce an algorithm to attempt the entity linking problem, a necessity for harvesting entity knowledge from large text collections. The goal is the linkage of mentions within the documents with their reaI-world entities. A major focus lies on the efficient derivation of coherent links.

For each of the contributions, we provide a wide range of experiments on various text corpora as well as different sources of UGC. The evaluation shows the added value that the usage of these sources provides and confirms the appropriateness of leveraging user-generated content to serve different information needs.