Email Structure

Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks

Project Description

Email communication plays an integral part of everybody's life nowadays. Especially for business emails, extracting and analysing these communication networks can reveal interesting patterns of processes and decision making within a company. Fraud detection is another application area where precise detection of communication networks is essential. In this paper we present an approach based on recurrent neural networks to untangle email threads originating from forward and reply behaviour. We further classify parts of emails into 2 or 5 zones to capture not only header and body information but also greetings and signatures.

We use the model presented in our ECIR paper in QuaggaLib. This library parses the raw email body into separate blocks and extracts meta-data from inline-headers. This kind of pre-processing should be used in all applications using email data. The library provides the actual written text content as well as the meta-data that would otherwise be hidden in the unstructured email body.

Reference

If you use our data or find this work related to yours, please cite us as...

Implementations

Production-ready email parsing:
- https://github.com/HPI-Information-Systems/QuaggaLib
Reference implementation as used in the paper including competitor approaches and data:
- https://github.com/HPI-Information-Systems/Quagga

Datasets

On this page we provide datasets used in our ECIR 2018 paper and a fully parsed Enron corpus. Data was manually annotated using our Enno tool.

newly collected ASF email corpus, annotated by email zones only
selection of Enron corpus, annotated by email zones only
selection of Enron corpus, detailled annotation (including names, aliases, metadata)
automatically split, normalised, and cleaned Enron corpus as graph

Apache Software Foundation Emails (ASF)

Sampling

For details on sampling, please look at the scripts and logs here. We samples emails from mailing lists of the Apache Software Foundation via http://mail-archives.apache.org/mod_mbox/, specifically 50 for flink-user for evaluation, 100 from goovy-users for testing, and 250 from hadoop-user for training. Therefore we randomly selected the last mail of a thread within 2017.

Format

Each *.txt file is the original email downloaded as described above. Each *.ann file is created by Enno and is a json file with "text" containing the full original email, an "id", empty "meta" data, and a list of "denotations". These are the zomes of the email thread, whereas each has an "id", a "start" and "end" character offset, the according "text", a "type" and empty "meta" data.

Types are: Header, Body, Body/Intro, Body/Outro, Body/Signature. Intro and outro are phrases such as "Dear all" and "Thank you, Tim". The Body includes these phrases and signatures.

Data

download here

Annotated Enron Emails

Sampling

For details on sampling, please look at the scripts and logs here. This is based on the Enron corpus provided by the CMU.

Format

Same as for the ASF dataset. Additionally, the detailled annotation contains denotations for all semi-structured metadata in the email, for example the from and to fields in inline-headers down to the level where email addresses and name ares are individual denotations. Also the names in the Body/Intro and Body/Outro and Signatures are marked. The "meta" field contains the parsed email header. There is also a list of "relations", where each has an "id", an "origin" and "target referring to the denotation id, for example when an email address is an alias for a person's name. There are relation types for "ContactInfo", "Alias", "WorksFor".

This data can potentially be used to create a high detailed parser to extract lots of hidden metadata from email thread text.

Data

Enron Data with detailled annotation (~400 emails)
Enron Data with zone annotation (as ASF, ~800 emails)
Enron Data with rudimentary Header/Body annotation (~170 emails)

Fully Parsed Enron Graph

Using our QuaggaLib and some additional processing, we create a new Enron corpus. This is a more truthful corpus, because

each email only appears once,
emails are stripped to only contain actual text
people are disambiguated
meta data is extracted and can be used directly

Code is also in the repository above and here.

The corpus is available as

GraphML
JSON, one email per line (part a, part b; simply concatenate and untar)
List of people

Related Work

Original Code for Jangada, Carvalho, 2004
More infos and data for Jangada (600+ annotated mails in 20 newsgroup dataset)
MinorThird Library used by Jangada
400 annotated emails by Lampert et. al (Enron data)
Zebra System for email zoning
Another implementation of Zebra
Talon is an awesome universal tool for everything that has to do with email structure