Topic modeling is a probabilistic approach to text mining. Large document collections are analyzed to identify latent (hidden) topics. The most popular topic model is latent Dirichlet allocation (LDA). This algorithm is based on a generative view of the writing process of documents which it tries to reverse engineer: "An author decides to write a document about the olympics in Rio. She picks the topics 'sports' and 'brazil'. Then for each word she wants to write she picks a word from one of the two topics." LDA only sees the final documents and has to estimate what these underlying topics are and how they look. There are many extensions and adaptions of the basic LDA method. We will focus on two of the four advanced topic models:
- Labeled LDA (paper)
- Author-topic model (paper)
- Dynamic topic model (paper)
- Hierarchical topic model (paper)
These models heavily rely on methods from probability theory and statistics. Prior knowledge is not required but certainly helpful. If you start sweating when seeing a math formula, this is not the right seminar for you. If you are curious and motivated to understand these models then this seminar is great. WARNING: Depending on your background you might have to do a lot of reading and thinking...
In this seminar, each student will present one topic modeling paper in a 30-minutes talk followed by 30 minutes of discussion. We will start with two groups of students implementing latent Dirichlet Allocation using
- Collapsed Gibbs sampling (paper) and
- Variational Bayes (paper).
Then we will split up the two groups into teams of two and each team will work on one advanced topic model (presentation+implementation). At the end of the semester, each team's implemented topic model will be used modeling a dataset of US patents. Finally, each team has to hand in a written summary report (5 pages; two column style) of their topic. Active participation in all discussions is mandatory. If all goes well, we plan to compile a scientific paper of the reports and submit it to a conference or workshop. Depending on the success of this seminar we will offer a Master's project or Master's thesis topic in the following semester based on this seminar...
This seminar is limited to 8 participants. If more apply we will pick randomly.
The grade will consist of
- 25% Presentation of Paper
- 25% Active Participation
- 50% Project Report
The seminar usually takes place Tuesdays at 11:00 in D-E 9/10.