Our team is giving a series of lectures and seminars with a focus on enterprise systems design and in-memory data management. Strong links to the industry ensure a close connection between theory and its implementation in the real world.

If you are having questions regarding one of our publications, please contact the authors.

Bachelor Project HP2: Text Mining for Biomedical Applications

General information

Overall responsibility: Dr. Mariana Neves, Milena Kraus, Dr. Matthias Uflacker
Kick-off meeting: (to be defined), Villa, Campus D

Motivation

The current data deluge demands fast and real-time processing of large datasets to support various applications, also for textual data, such as scientific publications. Natural language processing (NLP) is the field of automatically processing textual documents and includes a variety of tasks such as tokenization (delimitation of words), part-of-speech tagging (assignment of syntactic categories to words), chunking (delimitation of phrases) and syntactic parsing (construction of syntactic tree for a sentence). Further, NLP also involves semantic-related tasks such as named-entity recognition (delimitation of predefined entity types, e.g., person and organization names), relation extraction (identification of pre-defined relations from text) and semantic role labeling (determining pre-defined semantic arguments). Processing and semantically annotating large textual collection is a time-consuming and tiresome task, which requires integration of various tools. In-memory database (IMDB) technology comes as an alternative given its built-in text analysis and machine learning components and its ability to process large document collections in real time.

Data curation is one application of NLP and consists on the development of a text mining pipeline for automatic extraction of predefined data from textual documents. For instance, biological databases need to extract precise data from scientific literature according to a existing template. Thus, given a predefined template and corresponding list of terminologies, a text mining application can automatically fill in the slots with the required information. A text mining pipeline usually includes three main components: (a) triage of relevant documents; (b) named-entity recognition, and (c) slot filling or relationship extraction. These tasks usually rely on machine learning methods, provided a corpus of annotated documents, i.e., supervised or semi- supervised learning, a set of previous curated data, i.e., distant supervision, or even when no data at all is available, i.e., unsupervised learning.

Project goals

Build a text mining pipeline for integration of triage, named-entity recognition and relationship extraction tasks
Develop a Web application to interact with the text mining pipeline
Apply the system to large collections of documents, such as PubMed, a database of biomedical publications
Evaluate the system on curation of data for external partners, e.g., cancer research

External partners

The project will be executed in cooperation with SAP SE and potentially further external partners. We expect knowledge exchange and visits of partners.

Skills

Participants should have knowledge of SQL, of at least one programming language (preferably C++, Python or Java) and of Web development, as well as interest in database technologies, machine learning and natural language processing.

News

22.09.2023 | Trends and Concepts in the Softwareindustry Seminar offered in WiSe 2023/2024

Trends and Concepts in the Softwareindustry Seminar offered in WiSe 2023/2024 > Go to article

22.05.2023 | Christopher Hagedorn Successfully Defended His PhD Thesis

Christopher Hagedorn Successfully Defended His PhD Thesis > Go to article

03.03.2023 | Last Trends and Concepts course of Prof. Hasso Plattner

After more than 20 years of teaching, our founder and benefactor Prof. Hasso Plattner visited the HPI this week for his … > Go to article

01.03.2023 | Jan Kossmann Successfully Defended His PhD Thesis

Last week, Jan Kossmann another PhD student of our EPIC group successfully defended his thesis on the topic of … > Go to article

26.02.2023 | Paper on Data Tiering in Hyrise Published in BTW Proceedings

Our latest paper on data tiering in Hyrise "Workload-Driven Data Placement for Tierless In-Memory Database Systems" by … > Go to article

24.02.2023 | Paper on EPIC Research Group Published in SIGMOD Record

Our report “Enterprise Platform and Integration Concepts Research at HPI” has been published in the December issue of … > Go to article

30.11.2022 | Paper on Database Optimizations for Spatio-Temporal Data published in PVLDB

Our paper “Robust and Budget-Constrained Encoding Configurations for In-Memory Database Systems” has been published in … > Go to article

04.10.2022 | Günter Hesse Successfully Defended His PhD Thesis

Last week, Günter Hesse another PhD student of our EPIC group successfully defended his thesis on the topic of "A … > Go to article

08.07.2022 | Successful PhD Defense by Markus Dreseler

Markus Dreseler has successfully defended his PhD thesis on Automatic Tiering for In-Memory Database Systems. > Go to article

Literature

"A Course in In-Memory Data Management" by Prof. Dr. h.c. Hasso Plattner. This book is the culmination of six years work of in-memory research. As such, it provides the technical foundation for combined transactional and analytical workloads inside one single database as well as examples of new applications that are now possible given the availability of the new technology. The book is available at Springer.

Contact

Dr. Michael Perscheid

Chair Representative

Tel.: +49 (331) 5509-566

E-Mail: michael.perscheid(at)hpi.de

Office:

Room: V-2.12

Tel.: +49 (331) 5509-560

Fax: +49 (331) 5509-579

E-Mail: office-epic(at)hpi.de

Contact Details