The goal of this Master project is to develop and implement novel concepts for a Question Answering system for the biomedical domain. Question Answering (QA) is one of the more complex applications of Natural Language Processing (NLP) and consists of processing questions in natural language, instead of the usual keyword-based query, and providing exact answers in return, instead of potential relevant documents. NLP includes a variety of tasks such as tokenization (delimitation of words), part-of-speech tagging (assignment of syntactic categories to words), chunking (delimitation of phrases) and syntactic parsing (building a syntactic structure for a sentence). It also involves semantic-related tasks such as named-entity recognition (delimitation of predefined entity types, e.g., person and organization names), relation extraction (identification of pre-defined relations from text) and semantic role labeling (assignment of pre-defined semantic roles to phrases).
QA systems involve integration of many of the NLP components in their three main steps as described below:
- question processing: processing of questions and construction of queries;
- passage retrieval: retrieval of sentences or short text passages relevant to the question and based on the derived query;
- answer processing: extracting the exact answer(s) and/or building summaries for the provided question.
The current textual data deluge, e.g., scientific publications, Web pages or messages in the social media, demands fast and real-time processing to support various NLP applications, especially for QA. There are currently three QA systems for Biomedicine, but none of them provides fast and reliable answers to the users. A recent comparison of their results and time response for 40 randomly selected questions from the EU-funded BioASQ dataset shows that a correct answer was returned only for 5 of these questions when merging answers were returned by all three systems. Further, average time response varied from 10 seconds to a maximum of 100 seconds, after which no answer was returned for many of the questions.
Building real-time applications that integrate many of these NLP com ponents is a challenge as these are time-consuming processes. In-memory database (IMDB) technology comes as an alternative given its ability to process large document collections quickly in real time. Our system runs on the top of SAP HANA database and has been scored first for passage retrieval in the last edition of the BioASQ challenge.
Project goals
- Participate in the development of a Question Answering system for the biomedical domain
- Implement new NLP functionalities in SAP HANA database
- Adapt current NLP features in SAP HANA to the biomedical domain
- Evaluate the application on the BioASQ dataset
Technology and skills
Participants should have knowledge of SQL and at least one programming language (Python, Java, preferably C++), and be interested in database technologies (in-memory database, stored procedures), natural language processing and machine learning (supervised and semi-supervised learning). No previous knowledge on Biomedicine is necessary, domain knowledge will be integrated through the use of available resources (ontologies, dictionaries, corpora, etc).