Prof. Dr. Felix Naumann

Tim's Ideas for Theses

Please note, that everything listed below are only rough ideas. Their intention is to give you a broader sense of what could be done. See it as seed ideas for a thesis, maybe it needs more, maybe it's too complex. Feel free to read between the lines, extend ideas, go in depth or combine them if applicable and contact me with your questions or proposals.

This is part of my personal backlog of things I find interesting or am/was working on to some extent.

Autoencoder/Embedding vs PCA, tSNE & Co.

Unfortunately, our world has only three dimensions, computer screens only two even. To visualise high-dimensional data, we have to reduce it to see it. This process is comes with the cost of losing information, as all the data has to be compressed into a much smaller space. In the case of embeddings (i.e. word2vec) this is very helpful. Mikolov et al. used a very simple neural network to reduce a large vocabulary of words down to 50/100/300 dimensions and found, that this lower dimensional space has useful properties, for example that the Euclidean distance between data points can be interpreted as semantic similarity. To visualise the vocabulary, data scientists and researchers often use tSNE (Marten et al, 2009). While other dimensionality reduction approaches (e.g. PCA) may not reflect relatedness properties well, tSNE uses a cost function, which keeps points that are clustered in the high dimensional space together even in the low dimensional space.

In this master’s thesis, we want to compare the influence algorithms for the first stage in tSNE. Furthermore, we would like to transfer the tSNE cost function to some kind of auto-encoder, so that we are able to reduce all data in one go with a single (simple) model. This thesis can be very low-level and math heavy (proper understanding of mathematical model behind neural nets & develop cost function) or engineering heavy (comparing lots of existing methods to find good constellations).

#DataVisualisation #DeepLearning

Email Classification

For managing an email inbox and especially for an email exploration tool for forensic reasons, classifying emails is an important step for processing a (large) corpus of emails. Imagine inboxes of several employees are audited after a case of fraud. Fixed classifiers can act as filters to reduce the amount of mails to look at and steer the search for relevant documents towards the case. Fixed classifiers could identify work related or personal emails, irrelevant information from mailing list, mails where decisions are made, appointments are proposed, meetings are summarised, ... [ANLP]

This task could also be hierarchical in the sense that mails with invoices (determined by the fixed classifier) are then clustered in an unsupervised fashion. Features not only involve the content, but also information flow patterns through the social network graph.

Email deconstruction

Working with email corpora, no matter if you consider social network analysis, topic modeling, text mining, language processing or other tasks, raises the issue of noisy data. Emails contain entire threads of conversations, where previous parts are contained as copies along with their metadata (sender, recipients, date, forwards/answers,...). Detecting parts of those mails enables us to reconstruct the true social network, deduplicate content in the corpus and have clean text.

We have a working prototype that classifies lines as part of a header or email body and splits conversations. This prototype can be improved, for example by training a deep belief recurrent artificial neural network (or similar) in an unsupervised fashion and then teaching it with labelled samples how to split mails. Further, we want to detect signatures and common phrases at the start and end of mails, that introduce noise in downstream tasks.

Document Classification

Part of our research involves handling large collections of diverse files ranging from contracts, memos, photos to emails and more. In order to make such a collection more accessible, a first step would be to classify the documents, such that more specialised downstream tasks can be applied with higher accuracy. Our project partner suggested about 400 classes and subclasses for internal documents. A masters student would have to work on other data though, such as those attached in the Enron Corpus.

Downstream tasks could involve the extraction of contract partners, salient entities/phrases, or time related information.

Data Visualisation and Enrichment

Our goal of the Business Communication Analysis project is to make large collections of files explorable in the context of a forensic investigation by an auditor looking in to a case of fraud or similar occasions. Given a diverse set of extraction and classifcation techniques is just the first step towards an automatic exploration. However, lists of words and numbers are not very helpful.

We need novel ideas to visualise previously extracted or generated information. That involves the combination of people in a conversation in a time based context, identifying events, temporal salient phrases/entities, flow of information or simply timelines, where abstract "topics" link to higher resolution information and/or ultimately the original documents and/or their connection to others. Everything can be enriched/linked with external knowledge, such as a network of company relationships we developed in our group.

The goal is to provide a powerful toolkit to special investigators to make their work more efficient by providing a useful overview of what is inside the document collection, far, far beyond filename and date of creation.

Document Segmentation and Labelling

[to be clarified]

(Visually) segment blocks in a document and classify the type of that block. Context: a collection of scanned documents has to be analysed. Running OCR on individual blocks as opposed to first run OCR and than try to identify blocks might be easier, as documents' layouts originally are meant to make it easier for humans to find the information they are looking for. This becomes especially helpful once we have document classification in place and know, that something is a receipt for example. Segmenting it and using ANNs or CRFs to create a labelled mask would help to extract meta-data about that document.

Entity Disambiguation

There are multiple sources to extract entities from in email corpora. Firstly, there are the names of people in the "from" and "to" fields of an email. Thoughout the corpus the name may change (Ken Lay, Kenneth Lay, K. Lay, ken.ley@enron.com, ...). Keeping track of these aliases is not as straight forward as one may think. Furthermore, linking mentions in the email text, possibly found using named entitiy recognition, to these people is even trickier.

In order to improve the disambiguation accuracy, we can make use of additional information other than naive string matching itself. For example we can use probabilistic models that consider time and context or the neighbourhood in the social network graph spanned by the email communication. The preprocessing model developed and evaluated in this thesis will be a key component to a bigger system for document exploration. Using these findings, it is possible to link pieces of information to a person, rather than just a name.

Data Tinder

Almost all extraction tasks suffer from the same issue: the tradeoff between precision and recall. Do you want a model that will find all possible things (e.g. named entities in a text) but will also find many additional entities that are in fact no entities (or include additional tokens/characters) or should the model be very precise, meaning that the identified entities are actually entities but you miss a lot of those that could have been found.

In OpenIE tasks, usually the model is tuned for precision. In this thesis, we want to investigate active learning models to identify structural/methodic errors and clean the extracted data. The interface for the human-in-the-loop should be as simple as possible, imagine having a Tinder-like endpoint that can be used to label data on the fly everywhere you are. Usually all extraction tasks are evaluated requires human annotators to sit on a desktop computer and precicely label (e.g. highlight text that contains an entity) the data, which is a boring and painstaking task. However, if the machine suggests entities that a user can annotate while waiting in line or when bored on the train, that could significantly improve the annotation experience and therefore increase the amount of labelled data.

In this thesis, a student has to develop a model, that enables this level of annotation simplicity. The machine would have to learn from the feedback it gets and improve the extraction model. Thy hypothesis (based on experience) is, that extraction models have common errors. The "cleanup" model would have to identify rules or the structure of these errors and thus clean the extracted data.

#RepresentationLearning #ActiveLearning