Sampling
For details on sampling, please look at the scripts and logs here. We samples emails from mailing lists of the Apache Software Foundation via http://mail-archives.apache.org/mod_mbox/, specifically 50 for flink-user for evaluation, 100 from goovy-users for testing, and 250 from hadoop-user for training. Therefore we randomly selected the last mail of a thread within 2017.
Format
Each *.txt file is the original email downloaded as described above. Each *.ann file is created by Enno and is a json file with "text" containing the full original email, an "id", empty "meta" data, and a list of "denotations". These are the zomes of the email thread, whereas each has an "id", a "start" and "end" character offset, the according "text", a "type" and empty "meta" data.
Types are: Header, Body, Body/Intro, Body/Outro, Body/Signature. Intro and outro are phrases such as "Dear all" and "Thank you, Tim". The Body includes these phrases and signatures.
Data
download here