Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Advanced Topic Modeling

Lecturer: Dr. Ralf Krestel

Description

Topic modeling is a probabilistic approach to text mining. Large document collections are analyzed to identify latent (hidden) topics. The most popular topic model is latent Dirichlet allocation (LDA). This algorithm is based on a generative view of the writing process of documents which it tries to reverse engineer: "An author decides to write a document about the olympics in Rio. She picks the topics 'sports' and 'brazil'. Then for each word she wants to write she picks a word from one of the two topics." LDA only sees the final documents and has to estimate what these underlying topics are and how they look. There are many extensions and adaptions of the basic LDA method. We will focus on two of the four advanced topic models:

  • Labeled LDA (paper)
  • Author-topic model (paper)
  • Dynamic topic model (paper)
  • Hierarchical topic model (paper)

These models heavily rely on methods from probability theory and statistics. Prior knowledge is not required but certainly helpful. If you start sweating when seeing a math formula, this is not the right seminar for you. If you are curious and motivated to understand these models then this seminar is great. WARNING: Depending on your background you might have to do a lot of reading and thinking...

In this seminar, each student will present one topic modeling paper in a 30-minutes talk followed by 30 minutes of discussion. We will start with two groups of students implementing latent Dirichlet Allocation using

  • Collapsed Gibbs sampling (paper) and
  • Variational Bayes (paper).

Then we will split up the two groups into teams of two and each team will work on one advanced topic model (presentation+implementation). At the end of the semester, each team's implemented topic model will be used modeling a dataset of US patents. Finally, each team has to hand in a written summary report (5 pages; two column style) of their topic. Active participation in all discussions is mandatory. If all goes well, we plan to compile a scientific paper of the reports and submit it to a conference or workshop. Depending on the success of this seminar we will offer a Master's project or Master's thesis topic in the following semester based on this seminar...

This seminar is limited to 8 participants. If more apply we will pick randomly.

The grade will consist of

  • 25% Presentation of Paper
  • 25% Active Participation
  • 50% Project Report

The seminar usually takes place Tuesdays at 11:00 in D-E 9/10.

Schedule

Date |Topic | Presenter    
12.4.16|Introduction | Ralf Krestel    
18.4.16|13:00 Deadline for Registration via E-mail (15:00 Notification) |    
19.4.16|Initial Meeting| Ralf Krestel    
26.4.16| Optional Meetings |    
3.5.16| Meeting Gibbs Group (45min) and Bayes Group (45min) |    
10.5.16| No Meeting |    
17.5.16| Two presentations Gibbs and Bayes | Teams    
24.5.16| Comparing Gibbs and Bayes Code, New Teams | Ralf Krestel    
31.5.16| Optional Meeting |    
7.6.16| No Meeting |    
14.6.16 | Individual Meetings |    
21.6.16 | Two presentations | Teams    
28.6.16 | No Meeting |    
5.7.16 | Comparing Code and Discussing Experiments | Ralf Krestel    
12.7.16 | No Meeting |    
21.7.16 | Discussing Results | Teams    
31.8.16 |Report Deadline|