Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Tim's Ideas for Theses

Please note, that everything listed below are only rough ideas. Their intention is to give you a broader sense of what could be done. See it as seed ideas for a thesis, maybe it needs more, maybe it's too complex. Feel free to read between the lines, extend ideas, go in depth or combine them if applicable and contact me with your questions or proposals.

This is part of my personal backlog of things I find interesting or am/was working on to some extent.

Email Classification

For managing an email inbox and especially for an email exploration tool for forensic reasons, classifying emails is an important step for processing a (large) corpus of emails. Imagine inboxes of several employees are audited after a case of fraud. Fixed classifiers can act as filters to reduce the amount of mails to look at and steer the search for relevant documents towards the case. Fixed classifiers could identify personal emails, irrelevant information from mailing list, mails where decisions are made, appointments are proposed, meetings are summarised, ... [ANLP]

This task could also be hierarchical in the sense that mails with invoices (determined by the fixed classifier) are then clustered in an unsupervised fashion. Features not only involve the content, but also information flow patterns through the social network graph.

Email deconstruction

Working with email corpora, no matter if you consider social network analysis, topic modeling, text mining, language processing or other tasks, raises the issue of noisy data. Emails contain entire threads of conversations, where previous parts are contained as copies along with their metadata (sender, recipients, date, forwards/answers,...). Detecting parts of those mails enables us to reconstruct the true social network, deduplicate content in the corpus and have clean text.

We have a working prototype that classifies lines as part of a header or email body and splits conversations. This prototype can be improved, for example by training a deep belief recurrent artificial neural network (or similar) in an unsupervised fashion and then teaching it with labelled samples how to split mails. Further, we want to detect signatures and common phrases at the start and end of mails, that introduce noise in downstream tasks.

Document Classification

Part of our research involves handling large collections of diverse files ranging from contracts, memos, photos to emails and more. In order to make such a collection more accessible, a first step would be to classify the documents, such that more specialised downstream tasks can be applied with higher accuracy. Our project partner suggested about 400 classes and subclasses for internal documents. A masters student would have to work on other data though, such as those attached in the Enron Corpus.

Downstream tasks could involve the extraction of contract partners, salient entities/phrases, or time related information.

Data Visualisation and Enrichment

Our goal of the Business Communication Analysis project is to make large collections of files explorable in the context of a forensic investigation by an auditor looking in to a case of fraud or similar occasions. Given a diverse set of extraction and classifcation techniques is just the first step towards an automatic exploration. However, lists of words and numbers are not very helpful.

We need novel ideas to visualise previously extracted or generated information. That involves the combination of people in a conversation in a time based context, identifying events, temporal salient phrases/entities, flow of information or simply timelines, where abstract "topics" link to higher resolution information and/or ultimately the original documents and/or their connection to others. Everything can be enriched/linked with external knowledge, such as a network of company relationships we developed in our group.

The goal is to provide a powerful toolkit to special investigators to make their work more efficient by providing a useful overview of what is inside the document collection, far, far beyond filename and date of creation.

Document Segmentation and Labelling

[to be clarified]

(Visually) segment blocks in a document and classify the type of that block. Context: a collection of scanned documents has to be analysed. Running OCR on individual blocks as opposed to first run OCR and than try to identify blocks might be easier, as documents' layouts originally are meant to make it easier for humans to find the information they are looking for. This becomes especially helpful once we have document classification in place and know, that something is a receipt for example. Segmenting it and using ANNs or CRFs to create a labelled mask would help to extract meta-data about that document.