Working with email corpora, no matter if you consider social network analysis, topic modeling, text mining, language processing or other tasks, raises the issue of noisy data. Emails contain entire threads of conversations, where previous parts are contained as copies along with their metadata (sender, recipients, date, forwards/answers,...). Detecting parts of those mails enables us to reconstruct the true social network, deduplicate content in the corpus and have clean text.
We have a working prototype that classifies lines as part of a header or email body and splits conversations. This prototype can be improved, for example by training a deep belief recurrent artificial neural network (or similar) in an unsupervised fashion and then teaching it with labelled samples how to split mails. Further, we want to detect signatures and common phrases at the start and end of mails, that introduce noise in downstream tasks.