Prof. Dr. Felix Naumann

Tim's Ideas for Theses

Please note, that everything listed below are only rough ideas. Their intention is to give you a broader sense of what could be done. See it as seed ideas for a thesis, maybe it needs more, maybe it's too complex. Feel free to read between the lines, extend ideas, go in depth or combine them if applicable and contact me with your questions or proposals.

This is part of my personal backlog of things I find interesting or am/was working on to some extent.

Email Classification

For managing an email inbox and especially for an email exploration tool for forensic reasons, classifying emails is an important step for processing a (large) corpus of emails. Imagine inboxes of several employees are audited after a case of fraud. Fixed classifiers can act as filters to reduce the amount of mails too look at and steer the search for relevant documents towards the case. Fixed classifiers could identify personal emails, irrelevant information from mailing list, mails where decisions are made, appointments are proposed, meetings are summarised, ...

This task could also be hierarchical in the sense that mails with invoices (determined by the fixed classifier) are then clustered in an unsupervised fashion. Features not only involve the content, but also information flow patterns through the social network graph.

Email deconstruction

Working with email corpora, no matter if you consider social network analysis, topic modeling, text mining, language processing or other tasks, raises the issue of noisy data. Emails contain entire threads of conversations, where previous parts are contained as copies along with their metadata (sender, recipients, date, forwards/answers,...). Detecting parts of those mails enables us to reconstruct the true social network, deduplicate content in the corpus and have clean text.

We have a working prototype that classifies lines as part of a header or email body and splits conversations. This prototype can be improved, for example by training a deep belief recurrent artificial neural network (or similar) in an unsupervised fashion and then teaching it with labelled samples how to split mails. Further, we want to detect signatures and common phrases at the start and end of mails, that introduce noise in downstream tasks.