Master Project 2015

General Information

Overall responsibility: Dr. Mariana Neves, Cindy Perscheid, Dr. Matthias Uflacker
Kick-off meeting: April 10th, 2015 (Friday), Villa, Campus D

The goal of this Master project is to develop and implement novel concepts for a Question Answering system for the biomedical domain. Question Answering (QA) is one of the more complex applications of Natural Language Processing (NLP) and consists of processing questions in natural language, instead of the usual keyword-based query, and providing exact answers in return, instead of potential relevant documents. NLP includes a variety of tasks such as tokenization (delimitation of words), part-of-speech tagging (assignment of syntactic categories to words), chunking (delimitation of phrases) and syntactic parsing (building a syntactic structure for a sentence). It also involves semantic-related tasks such as named-entity recognition (delimitation of predefined entity types, e.g., person and organization names), relation extraction (identification of pre-defined relations from text) and semantic role labeling (assignment of pre-defined semantic roles to phrases).

QA systems involve integration of many of the NLP components in their three main steps as described below:

question processing: processing of questions and construction of queries;
passage retrieval: retrieval of sentences or short text passages relevant to the question and based on the derived query;
answer processing: extracting the exact answer(s) and/or building summaries for the provided question.

The current textual data deluge, e.g., scientific publications, Web pages or messages in the social media, demands fast and real-time processing to support various NLP applications, especially for QA. There are currently three QA systems for Biomedicine, but none of them provides fast and reliable answers to the users. A recent comparison of their results and time response for 40 randomly selected questions from the EU-funded BioASQ dataset shows that a correct answer was returned only for 5 of these questions when merging answers were returned by all three systems. Further, average time response varied from 10 seconds to a maximum of 100 seconds, after which no answer was returned for many of the questions.

Building real-time applications that integrate many of these NLP com ponents is a challenge as these are time-consuming processes. In-memory database (IMDB) technology comes as an alternative given its ability to process large document collections quickly in real time. Our system runs on the top of SAP HANA database and has been scored first for passage retrieval in the last edition of the BioASQ challenge.

Project goals

Participate in the development of a Question Answering system for the biomedical domain
Implement new NLP functionalities in SAP HANA database
Adapt current NLP features in SAP HANA to the biomedical domain
Evaluate the application on the BioASQ dataset

Technology and skills

Participants should have knowledge of SQL and at least one programming language (Python, Java, preferably C++), and be interested in database technologies (in-memory database, stored procedures), natural language processing and machine learning (supervised and semi-supervised learning). No previous knowledge on Biomedicine is necessary, domain knowledge will be integrated through the use of available resources (ontologies, dictionaries, corpora, etc).

Resources

GENIA corpus: tokenization, part-of-speech tagging and chunking
TweetNLP corpus: part-of-speech tagging
CoNNL-2000 corpus: chunking
CoNNL-2002/CoNNL-2005 corpus: semantic role labeling
BioProp corpus: semantic role labeling (description of predicates)
BioSmile tool/demo: semantic role labeling
BioASQ dataset: question answering
BioC project
WBI repository of biomedical corpora
Brat annotation tool (embedding brat)

Slides

Kick-off meeting

Master Project 2015

General Information

Project goals

Technology and skills

Suggested reading

Resources

Slides

News

22.09.2023 | Trends and Concepts in the Softwareindustry Seminar offered in WiSe 2023/2024

22.05.2023 | Christopher Hagedorn Successfully Defended His PhD Thesis

03.03.2023 | Last Trends and Concepts course of Prof. Hasso Plattner

01.03.2023 | Jan Kossmann Successfully Defended His PhD Thesis

26.02.2023 | Paper on Data Tiering in Hyrise Published in BTW Proceedings

24.02.2023 | Paper on EPIC Research Group Published in SIGMOD Record

30.11.2022 | Paper on Database Optimizations for Spatio-Temporal Data published in PVLDB

04.10.2022 | Günter Hesse Successfully Defended His PhD Thesis

08.07.2022 | Successful PhD Defense by Markus Dreseler

Literature

Contact