PhD Thesis 12
On the Use of Language Models and Topic Models in the Web: New Algorithms for Filtering, Classification, Ranking, and Recommendation
Abstract
The huge amount of information available in the Web poses various challenges to
Web science researchers. Finding interesting and relevant pieces of information is like
finding a needle in a haystack. Since most searchable information is either written in or
described by natural language, efficient and effective methods to represent natural language
are crucial. In this thesis we look into topic models and language models as a mean to
structure, compress, and represent textual data. In particular, we look into latent Dirichlet
allocation, a generative topic model, and its use for various task scenarios in the Web. We
discuss advantages and disadvantages in comparison with language models in the context of
common Web applications from areas such as information retrieval, recommender systems,
and machine learning.
In particular, we investigate the use of topic models and language models for filtering,
classification, ranking, and recommendation. These are popular methods to cope with
the information overload the average user has to deal with in the Web today. Many
applications make use of these methods, such as Web search engines, Web 2.0 platforms, or
recommender systems. We present different approaches based on topic model and language
model representations for the individual tasks.
Regarding the filtering of information, we propose an approach based on support vector
machines to predict the importance of news articles. We compare the use of topic models
and language models as the underlying representation of the newspaper articles to accurately
predict important news automatically. Sentiment classification deals with automatically
detecting the polarity of texts. This is often done by looking-up sentiment scores for each
term in a document in a lexicon. We propose a context-dependent sentiment lexicon based
on latent topics identified by latent Dirichlet allocation. The most common task in the Web
is probably the ranking of information. The huge success of search engines from Google
or Yahoo! is based on sophisticated ranking algorithms. We look into diversification of
rankings to cover the information needs of as many users as possible. Therefore we not
only investigate diversification of Web search results, but also diversification of product
review rankings. The recommendation of products, friends, news, or events is very popular
in e-commerce sites like Amazon, or social Web applications like Facebook. We propose
an approach for tag recommendation within folksonomies. We show that combining
topic models to overcome the cold start problem with language models to personalize the
recommendations is very successful.
Full Text
PhdThesis12.pdf
Download:
University Library
BibTex Entry