In-Memory Natural Language Processing

The current data deluge demands fast and real-time processing of large datasets to support various applications, also for textual data, such as scientific publications. Natural language processing (NLP) is the field of automatically processing textual documents. Processing and semantically annotating large textual collection is a time-consuming and tiresome task, which requires integration of various tools. In-memory database (IMDB) technology comes as an alternative given its ability to process large document collections quickly in real time.

Contact: Dr. Mariana Neves

Research Area: In-Memory Data Management for Life Sciences

We have a open position for students, please contact us!

Applications

Olelo is our NLP platform and integrate various NLP-related tasks for the biomedical domain.

Projects

NLP includes a variety of tasks such as tokenization (delimitation of words), part-of-speech tagging (assignment of syntactic categories to words), chunking (delimitation of phrases) and syntactic parsing (construction of syntactic tree for a sentence). Further, NLP also involves semantic-related tasks such as named-entity recognition (delimitation of predefined entity types, e.g., person and organization names), relation extraction (identification of pre-defined relations from text) and semantic role labeling (determining pre-defined semantic arguments). We have implemented many NLP methods in the SAP HANA database, as follow:

Chunking/shallow parsing and semantic role labeling (MP ss2015)
Named-entity recognition(BP 2015/2016)
Relation extraction (MP ws2015/2016)

There are many NLP applications that can be developed for various scenarios and domains, such as automatically generating summaries of one or more documents (summarization), retrieval of documents relevant for a particular query (information retrieval), extraction of specific information from a huge document collection (information extraction) and automatically answering questions posed by the users (question answering). We have developed NLP methods and applications for many of these task, as follow:

Deep learning to extract exact answers (Master thesis Georg Wiese)
Semantic role labeling to support question answering (Master thesis Fabian Eckert)
Olelo: intelligent navigation through the biomedical scientific literature (BP 2015/2016)
TextAI: intelligent annotation tool (MP ws2015/2016)
Generation of summaries for question answering system (MP ss2015 & Master thesis Frederik Schulze)
Generation of summaries for genes (Master thesis Frederik Schulze) (check our #GeneOfTheWeek summaries)

Challenges and Shared Tasks

We evaluated our methods on challenges and shared tasks organized by the scientific community:

BeCalm TIPS 2017
BioASQ 2014, 2015, 2016, 2017: We were one of the winners on the 2016 and 2015 challenges (cf. images on the right)
CLEF eHealth 2014
i2b2 2014 "De-identification" and "Identifying risk factors for heart disease over time" (in the scope of the Seminar In-Memory Computing for Life Sciences)

Resources

We developed resources to support training and evaluation of NLP methods:

Biomedical translation corpora: collection of biomedical corpora
Corposaurus: directory of biomedical corpora
Scielo corpus for machine translation, available in the BioC format.
BioMedLAT corpus: Annotation of BioASQ questions with lexical answer type (LAT), available in the stand-off format of the brat annotation tool.

Other activities

We are involved in the organization of various activities:

Biomedical translation task at the biomedical task in the Conference on Machine Translation (WMT) of 2016 and 2017.
Guest editor for the "Semantic Mining of Languages in Biology and Medicine" special issue in the Journal of Biomedical Semantics (JBMS) (to appear).
7th International Symposium on Semantic Mining in Biomedicine (SMBM 2016)

Publications (since Oct/2013)

Wiese G, Weissenborn D and Neves M. Neural Question Answering at BioASQ 5B, Biomedical Natural Language Processing (BioNLP) Workshop at ACL'17, accepted, Vancouver, Canada.
Neves M, Eckert F, Folkerts H and Uflacker M. Assessing the performance of Olelo, a real-time biomedical question answering application, Biomedical Natural Language Processing (BioNLP) Workshop at ACL'17, accepted, Vancouver, Canada.
Kraus M, Niedermeier J, Jankrift M, Tietböhl S, Stachewicz T, Folkerts H, Uflacker M and Neves M. Olelo: a web application for intuitive exploration of biomedical literature, Nucleic Acids Research Web service issue.
Neves M. A parallel collection of clinical trials in Portuguese and English, 10th Workshop on Building and Using Comparable Corpora (BUCC) at ACL'17, accepted, Vancouver, Canada. (accepted)
Neves M, Folkerts H, Jankrift M, Niedermeier J, Stachewicz T, Tietböhl S, Kraus M and Uflacker M. Olelo: A Question Answering Application for Biomedicine, ACL'17 Demo, Vancouver, Canada. (accepted)
Habibi M, Weber L, Neves M, Wiegandt D L and Leser U. Deep Learning with Word Embeddings improves Biomedical Named Entity Recognition, ISMB/ECCB 2017, Prague, Czech Republic. (accepted)
Folkerts H and Neves M. Olelo’s named-entity recognition web service in the BeCalm TIPS task, BeCalm Workshop 2017, Barcelona, Spain.
Nentidis A, Yang Z, Neves M, Kim J-D, Krithara A, Paliouras G and Kakadiaris I. BioASQ and PubAnnotation: Using linked annotations in biomedical question answering, BLAH3 workshop, 2017, Tokyo, Japan.
Neves M and Kraus M. BioMedLAT Corpus: Annotation of the Lexical Answer Type for Biomedical Questions, Open Knowledge Base and Question Answering Workshop, Coling 2016, Osaka, Japan.
Schulze F and Neves M. Entity-Supported Summarization of Biomedical Abstracts, Proceedings of the Firth Workshop on Building and Evaluating Resources for Biomedical Text Mining, Coling 2016, Osaka, Japan.
Neves M, Rey M and Wittig U. Text Mining to Support Data Curation for SABIO-RK, BLAHmuc workshop, 2016, Munich, Germany.
Cohen K B, Demner-Fushman D, Fort K, Grouin C, Hunter L E, U. Leser U, Neveol A, Neves M and Zweigenbaum P. Towards the Last Annotation Tool, BLAHmuc workshop, 2016, Munich, Germany.
Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno Yepes A, Koehn P, Logacheva V, Monz C, Negri M, Neveol A, Neves M, Popel M, Post M, Rubino R, Scarton C, Specia L, Turchi M, Verspoor K and Zampieri M. Findings of the 2016 Conference on Machine Translation, ACL 2016, Proceedings of the First Conference on Machine Translation (WMT16), pp. 131-198, 2016, Berlin, Germany.
Grundke M, Jasper J, Perchyk M, Sachse J P, Krestel R, Neves M. TextAI: Enhancing TextAE with Intelligent Annotation Support, 7th International Symposium on Semantc Mining for Biomedicine (SMBM), 2016, Potsdam, Germany.
Schulze F, Schüler R, Draeger T, Dummer D, Ernst A, Flemming P, Perscheid C, Neves M. Biomedical Question Answering Based on In-Memory Technology, ACL 2016, BioASQ Challenge, 2016, Berlin, Germany.
Neves M, Jimeno-Yepes A and Névéol A. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine, International Conference on Language Resources and Evaluation (LREC), 2016, Portoroz, Slovenia.
Neves M. HPI Question Answering System in the BioASQ 2015 Challenge , Working Notes for the CLEF BioASQ Challenge, 2015,Toulouse, France.
Neves M and Leser U. Question Answering for Biology, Methods, 2015.
Mariana Neves: HPI in-memory-based database system in Task 2b of BioASQ Working Notes for the CLEF BioASQ Challenge, 2014
Konrad Herbst, Cindy Fähnrich, Mariana Neves, Matthieu-P. Schapranow: Applying In-Memory Technology for Automatic Template Filling in the Clinical Domain, CLEF 2014 Evaluation Labs and Workshop, Online Working Notes, 2014
Mariana Neves, Konrad Herbst, Matthias Uflacker, Hasso Plattner: Preliminary evaluation of passage retrieval in biomedical multilingual question answering, BioTxtM 2014, Fourth Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing, 2014
Mariana Neves: Preliminary evaluation of question answering to support biological curation, Poster in the BioCuration Conference (ISB2014), 2014, Toronto, Canada.

In-Memory Natural Language Processing

Applications

Projects

Challenges and Shared Tasks

Resources

Other activities

Publications (since Oct/2013)

News

22.09.2023 | Trends and Concepts in the Softwareindustry Seminar offered in WiSe 2023/2024

22.05.2023 | Christopher Hagedorn Successfully Defended His PhD Thesis

03.03.2023 | Last Trends and Concepts course of Prof. Hasso Plattner

01.03.2023 | Jan Kossmann Successfully Defended His PhD Thesis

26.02.2023 | Paper on Data Tiering in Hyrise Published in BTW Proceedings

24.02.2023 | Paper on EPIC Research Group Published in SIGMOD Record

30.11.2022 | Paper on Database Optimizations for Spatio-Temporal Data published in PVLDB

04.10.2022 | Günter Hesse Successfully Defended His PhD Thesis

08.07.2022 | Successful PhD Defense by Markus Dreseler

Literature

Contact