Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. h.c. Hasso Plattner
  
 

Complete Publication List of the EPIC chiar

The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine

Mariana Neves, Antonio Jimeno Yepes, Aurélie Névéol
In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 2016 European Language Resources Association (ELRA).

Abstract:

The biomedical scientific literature is a rich source of information not only in the English language, for which it is more abundant, but also in other languages, such as Portuguese, Spanish and French. We present the first freely available parallel corpus of scientific publications for the biomedical domain. Documents from the ”Biological Sciences” and ”Health Sciences” categories were retrieved from the Scielo database and parallel titles and abstracts are available for the following language pairs: Portuguese/English (about 86,000 documents in total), Spanish/English (about 95,000 documents) and French/English (about 2,000 documents). Additionally, monolingual data was also collected for all four languages. Sentences in the parallel corpus were automatically aligned and a manual analysis of 200 documents by native experts found that a minimum of 79% of sentences were correctly aligned in all language pairs. We demonstrate the utility of the corpus by running baseline machine translation experiments. We show that for all language pairs, a statistical machine translation system trained on the parallel corpora achieves performance that rivals or exceeds the state of the art in the biomedical domain. Furthermore, the corpora are currently being used in the biomedical task in the First Conference on Machine Translation (WMT’16).

Keywords:

parallel corpus, biomedicine, machine translation

BibTeX file

@inproceedings{NEVES16.800,
author = { Mariana Neves, Antonio Jimeno Yepes, Aurélie Névéol },
title = { The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine },
year = { 2016 },
month = { 0 },
abstract = { The biomedical scientific literature is a rich source of information not only in the English language, for which it is more abundant, but also in other languages, such as Portuguese, Spanish and French. We present the first freely available parallel corpus of scientific publications for the biomedical domain. Documents from the ”Biological Sciences” and ”Health Sciences” categories were retrieved from the Scielo database and parallel titles and abstracts are available for the following language pairs: Portuguese/English (about 86,000 documents in total), Spanish/English (about 95,000 documents) and French/English (about 2,000 documents). Additionally, monolingual data was also collected for all four languages. Sentences in the parallel corpus were automatically aligned and a manual analysis of 200 documents by native experts found that a minimum of 79% of sentences were correctly aligned in all language pairs. We demonstrate the utility of the corpus by running baseline machine translation experiments. We show that for all language pairs, a statistical machine translation system trained on the parallel corpora achieves performance that rivals or exceeds the state of the art in the biomedical domain. Furthermore, the corpora are currently being used in the biomedical task in the First Conference on Machine Translation (WMT’16). },
keywords = { parallel corpus, biomedicine, machine translation },
publisher = { European Language Resources Association (ELRA) },
booktitle = { Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) },
isbn = { 978-2-9517408-9-1 },
priority = { 0 }
}

Copyright Notice

last change: Fri, 27 May 2016 13:56:23 +0200