Our team is giving a series of lectures and seminars with a focus on enterprise systems design and in-memory data management. Strong links to the industry ensure a close connection between theory and its implementation in the real world.

If you are having questions regarding one of our publications, please contact the authors.

Master Project HP/N: Learning to Note

Intelligent Support for Document Annotation using Semi-Supervised Learning

General information

Overall responsibility: Dr. Mariana Neves, Dr. Ralf Krestel, Prof. Felix Naumann, Dr. Matthias Uflacker
Kick-off meeting: September 30th, Villa, Campus D

Motivation

The goal of this master project is to develop a system to support manual annotation of documents and linking of entities to database records. Manual annotation of textual documents is often necessary for building corpora to support training and evaluation of natural language processing applications. For instance, corpora have been developed for the extraction of a variety of entities, e.g., genes/proteins, as well as relationships, e.g., protein-protein interactions. Although there are many tools for document annotation [2], they do not suggest pre-annotations based on text mining and machine learning and do not provide real-time learning.

Curation tools support extracting data from text collections for a certain topic [1]. For instance, biological databases need to extract precise information from publications, which are further stored into their databases and made available to the users via a Web interface. This is a time-consuming and complex task which requires careful reading of many publications.

For performance purposes, the tool will be built on top of the SAP HANA in-memory database, given its potential for processing large datasets in real-time and its built-in text analysis functionalities. Interaction of the users with the system will be carried out by uploading a document or a collection of documents. The system will include a text mining pipeline for automatic processing of documents and suggestion of annotations. This pipeline will contain the following components: recognition of pre-defined entity types and
extraction of pre-defined relationships between two or more entity types.

Further, ongoing annotations will be used for active learning of user preferences, for updating predictions of annotations and indicating which document to annotate next. This learning process will rely on existing machine learning algorithms implemented in the SAP HANA database, which will need to be adapted for on-line learning. Implementation of state-of-the-art on-line learning algorithms will also be considered.

Project Goals

Develop a Web application for annotation of documents and validation of data derived from text mining/machine learning
Build a text mining pipeline for integration of named-entity recognition and relationship extraction tasks
Evaluate the tool on benchmarks and for curation of real data
Submit a paper describing the system and/or the methods

Technology and Skills

Participants should have knowledge of SQL, of at least one programming language (preferably C++, Python or Java) and of Web development, as well as interest in database technologies, machine learning and natural language processing.

Slides

Kick-off meeting

News

22.09.2023 | Trends and Concepts in the Softwareindustry Seminar offered in WiSe 2023/2024

Trends and Concepts in the Softwareindustry Seminar offered in WiSe 2023/2024 > Zum Artikel

22.05.2023 | Christopher Hagedorn Successfully Defended His PhD Thesis

Christopher Hagedorn Successfully Defended His PhD Thesis > Zum Artikel

03.03.2023 | Last Trends and Concepts course of Prof. Hasso Plattner

After more than 20 years of teaching, our founder and benefactor Prof. Hasso Plattner visited the HPI this week for his … > Zum Artikel

01.03.2023 | Jan Kossmann Successfully Defended His PhD Thesis

Last week, Jan Kossmann another PhD student of our EPIC group successfully defended his thesis on the topic of … > Zum Artikel

26.02.2023 | Paper on Data Tiering in Hyrise Published in BTW Proceedings

Our latest paper on data tiering in Hyrise "Workload-Driven Data Placement for Tierless In-Memory Database Systems" by … > Zum Artikel

24.02.2023 | Paper on EPIC Research Group Published in SIGMOD Record

Our report “Enterprise Platform and Integration Concepts Research at HPI” has been published in the December issue of … > Zum Artikel

30.11.2022 | Paper on Database Optimizations for Spatio-Temporal Data published in PVLDB

Our paper “Robust and Budget-Constrained Encoding Configurations for In-Memory Database Systems” has been published in … > Zum Artikel

04.10.2022 | Günter Hesse Successfully Defended His PhD Thesis

Last week, Günter Hesse another PhD student of our EPIC group successfully defended his thesis on the topic of "A … > Zum Artikel

08.07.2022 | Successful PhD Defense by Markus Dreseler

Markus Dreseler has successfully defended his PhD thesis on Automatic Tiering for In-Memory Database Systems. > Zum Artikel

Literature

"A Course in In-Memory Data Management" by Prof. Dr. h.c. Hasso Plattner. This book is the culmination of six years work of in-memory research. As such, it provides the technical foundation for combined transactional and analytical workloads inside one single database as well as examples of new applications that are now possible given the availability of the new technology. The book is available at Springer.

Contact

Dr. Michael Perscheid

Chair Representative

Tel.: +49 (331) 5509-566

E-Mail: michael.perscheid(at)hpi.de

Office:

Room: V-2.12

Tel.: +49 (331) 5509-560

Fax: +49 (331) 5509-579

E-Mail: office-epic(at)hpi.de

Contact Details