For bachelor students we offer German lectures on database systems in addition with paper- or project-oriented seminars. Within a one-year bachelor project students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, search engines and information retrieval enhanced by specialized seminars, master projects and advised master theses.
Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our data sets and source code.
Art archives are a rich source of information for multiple reasons: proving the provenance of certain art pieces, facilitating research on art history, and understanding a particular artist with regard to the context of his or her work. These archives typically comprise of various kinds of heterogeneous documents: auction catalogs, personal correspondence, books, exhibition catalogs, bills, certificates, studies, theses, etc. Many of these archives are not easily accessible as they are not yet digitized. Even the ones that are available in digitized form are hard to explore with general text mining tools.
In this project, we aim to facilitate access to a large collection of art related documents. To this end, we need to adapt standard NLP tools to cater to the unique challenges of the art domain. The ultimate goal is to generate a knowledge graph which can be easily explored by art historians. The knowledge graph would also serve as a backbone for semantic search functionality and for new ways to represent art entities, e.g. as embeddings in a high dimensional space. Modern deep learning methods will be developed to manage and visualize large collections of art historical and scholarly documents.
Identification of titles of artworks as named entities is a complex task due to the challenges of this domain. Existing NER tools are not able to perform well for this task due to lack of availability of domain specific training data. In this project, we develop techniques to generate in a semi-automatic manner a large corpus of good quality training data with annotations for artwork titles. Retraining of existing NER tools on this training dataset shows considerable improvement over baseline.
This work was presented at the TPDL 2019 conference held in Oslo, Norway. A full version of the paper is available here.
Named entity recognition (NER) plays an important role in many information retrieval tasks, including automatic knowledge graph construction. Most NER systems are typically limited to a few common named entity types, such as person, location, and organization. However, for cultural heritage resources, such as art historical archives, the recognition of titles of artworks as named entities is of high importance. In this work, we focus on identifying mentions of artworks, e.g. paintings and sculptures, from historical archives. Current state of the art NER tools are unable to adequately identify artwork titles due to the particular difficulties presented by this domain. The scarcity of training data for NER for cultural heritage poses further hindrances. To mitigate this, we propose a semi-supervised approach to create high-quality training data by leveraging existing cultural heritage resources. Our experimental evaluation shows significant improvement in NER performance for artwork titles as compared to baseline approach.