Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks

Project Description

Email communication plays an integral part of everybody's life nowadays. Especially for business emails, extracting and analysing these communication networks can reveal interesting patterns of processes and decision making within a company. Fraud detection is another application area where precise detection of communication networks is essential. In this paper we present an approach based on recurrent neural networks to untangle email threads originating from forward and reply behaviour. We further classify parts of emails into 2 or 5 zones to capture not only header and body information but also greetings and signatures. 

We use the model presented in our ECIR paper in QuaggaLib. This library parses the raw email body into separate blocks and extracts meta-data from inline-headers. This kind of pre-processing should be used in all applications using email data. The library provides the actual written text content as well as the meta-data that would otherwise be hidden in the unstructured email body.

Implementations

Datasets

[coming soon] - Dec 2018 or Jan 2019; undocumented annotated data in our repositories

  • annotated ASF archive data
  • annotated enron data
  • detailled annotated enron data
  • selected IDs for train, test, eval from enron
  • fully quagga-processed enron corpus (text)
  • cleaned ("actual") enron communication graph