Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Theses on Corpus Exploration

In our Business Communication Analysis project we aim to develop novel tools for the exploration of massive unstructured document corpora, for example emails, attachments, and news. Journalists, auditors, or special investigators are overwhelmed by the sheer amount of data they have to analyse in order to gain insights from such datasets. There are many interesting topics for theses in this area that focus on different aspects of that work that touch the research areas of text mining, text summarisation, document classification, topic modelling, named entity extraction, entity linking, relationship extraction, as well as social network-, and graph analysis. We work together with our industry partner from the financial sector to put our prototypes in the hands of auditors for real world feedback.

Please note, that everything listed below are only rough ideas. Their intention is to give you a broader sense of what could be done. See it as seed ideas for a thesis, maybe it needs more, maybe it's too complex. Feel free to read between the lines, extend ideas, go in depth or combine them if applicable and contact Tim with your questions or proposals.

Most of the following ideas are mostly part of ourMímir project.

Genealogy of Language

Maps-like Navigation for Large Document Corpora

Extracting and Disambiguating Hidden Meta-data From Quoted Email Messages

Entity Disambiguation

Data Tinder

Aggregating and Monitoring Current Relevant News for Credit Risk Managers