Hasso Plattner Institut
Imprint   Data Privacy

Ralf Krestel

You are here:   Home > Publications > Others > PhD Thesis 12

PhD Thesis 12

On the Use of Language Models and Topic Models in the Web: New Algorithms for Filtering, Classification, Ranking, and Recommendation

The huge amount of information available in the Web poses various challenges to Web science researchers. Finding interesting and relevant pieces of information is like finding a needle in a haystack. Since most searchable information is either written in or described by natural language, efficient and effective methods to represent natural language are crucial. In this thesis we look into topic models and language models as a mean to structure, compress, and represent textual data. In particular, we look into latent Dirichlet allocation, a generative topic model, and its use for various task scenarios in the Web. We discuss advantages and disadvantages in comparison with language models in the context of common Web applications from areas such as information retrieval, recommender systems, and machine learning. In particular, we investigate the use of topic models and language models for filtering, classification, ranking, and recommendation. These are popular methods to cope with the information overload the average user has to deal with in the Web today. Many applications make use of these methods, such as Web search engines, Web 2.0 platforms, or recommender systems. We present different approaches based on topic model and language model representations for the individual tasks. Regarding the filtering of information, we propose an approach based on support vector machines to predict the importance of news articles. We compare the use of topic models and language models as the underlying representation of the newspaper articles to accurately predict important news automatically. Sentiment classification deals with automatically detecting the polarity of texts. This is often done by looking-up sentiment scores for each term in a document in a lexicon. We propose a context-dependent sentiment lexicon based on latent topics identified by latent Dirichlet allocation. The most common task in the Web is probably the ranking of information. The huge success of search engines from Google or Yahoo! is based on sophisticated ranking algorithms. We look into diversification of rankings to cover the information needs of as many users as possible. Therefore we not only investigate diversification of Web search results, but also diversification of product review rankings. The recommendation of products, friends, news, or events is very popular in e-commerce sites like Amazon, or social Web applications like Facebook. We propose an approach for tag recommendation within folksonomies. We show that combining topic models to overcome the cold start problem with language models to personalize the recommendations is very successful.
Full Text
Download: University Library
BibTex Entry


Watch our new MOOC in German about hate and fake in the Internet ("Trolle, Hass und Fake-News: Wie können wir das Internet retten?") on openHPI (link).

New Photos

I added some photos from my trip to Hildesheim.

Powered by CMSimple| Template: ge-webdesign.de| Login