Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Advanced Topic Modeling

Lecturer: Dr. Ralf Krestel

Description

Topic modeling is a probabilistic approach to text mining. Large document collections are analyzed to identify latent (hidden) topics. The most popular topic model is latent Dirichlet allocation (LDA). This algorithm is based on a generative view of the writing process of documents which it tries to reverse engineer: "An author decides to write a document about the olympics in Rio. She picks the topics 'sports' and 'brazil'. Then for each word she wants to write she picks a word from one of the two topics." LDA only sees the final documents and has to estimate what these underlying topics are and how they look. There are many extensions and adaptions of the basic LDA method. We will focus on two of the four advanced topic models:

  • Labeled LDA (paper)
  • Author-topic model (paper)
  • Dynamic topic model (paper)
  • Hierarchical topic model (paper)

These models heavily rely on methods from probability theory and statistics. Prior knowledge is not required but certainly helpful. If you start sweating when seeing a math formula, this is not the right seminar for you. If you are curious and motivated to understand these models then this seminar is great. WARNING: Depending on your background you might have to do a lot of reading and thinking...

In this seminar, each student will present one topic modeling paper in a 30-minutes talk followed by 30 minutes of discussion. We will start with two groups of students implementing latent Dirichlet Allocation using

  • Collapsed Gibbs sampling (paper) and
  • Variational Bayes (paper).

Then we will split up the two groups into teams of two and each team will work on one advanced topic model (presentation+implementation). At the end of the semester, each team's implemented topic model will be used modeling a dataset of US patents. Finally, each team has to hand in a written summary report (5 pages; two column style) of their topic. Active participation in all discussions is mandatory. If all goes well, we plan to compile a scientific paper of the reports and submit it to a conference or workshop. Depending on the success of this seminar we will offer a Master's project or Master's thesis topic in the following semester based on this seminar...

This seminar is limited to 8 participants. If more apply we will pick randomly.

The grade will consist of

  • 25% Presentation of Paper
  • 25% Active Participation
  • 50% Project Report

The seminar usually takes place Tuesdays at 11:00 in D-E 9/10.

Schedule

Date Topic Presenter
12.4.16 Introduction Ralf Krestel
18.4.16 13:00 Deadline for Registration via E-mail (15:00 Notification)
19.4.16 Initial Meeting Ralf Krestel
26.4.16 Optional Meetings
3.5.16 Meeting Gibbs Group (45min) and Bayes Group (45min)
10.5.16 No Meeting
17.5.16 Two presentations Gibbs and Bayes Teams
24.5.16 Comparing Gibbs and Bayes Code, New Teams Ralf Krestel
31.5.16 Optional Meeting
7.6.16 No Meeting
14.6.16 Individual Meetings
21.6.16 Two presentations Teams
28.6.16 No Meeting
5.7.16 Comparing Code and Discussing Experiments Ralf Krestel
12.7.16 No Meeting
21.7.16 Discussing Results Teams
31.8.16 Report Deadline