Multifaceted Domain-Specific Document Embeddings

This page corresponds to our demo paper titled "Multifaceted Domain-Specific Document Embeddings" by Julian Risch, Philipp Hager, and Ralf Krestel. It has been accepted for presentation at NAACL'21 and is based on a Master's thesis by Philipp Hager.

Demo

To show the practical feasibility of our approach, we implemented a demo, which can be accessed here. Our source code and the evaluation datasets are available on GitHub and a screencast is on YouTube.

Abstract

Word and document embeddings are Natural Language Processing (NLP) techniques that map words to fixed-length numerical vectors in an embedding space. Current embedding algorithms work well when trained on large text corpora, but fail to produce high-quality vectors when given a small number of documents or are confronted with uncommon terms, as is often the case in specialized domains. Secondly, it is common to blend the entire document into a single embedding vector, making it hard to find documents relating only to a specific piece of information or to explain why two documents are considered similar. In this work, we propose a novel approach to train document embeddings for domain-specific texts. We use a siamese neural network architecture in combination with knowledge graphs to train document embeddings on a small number of training examples from the medical domain. The model identifies different types of domain knowledge and encodes them into separate dimensions of our embedding, thereby enabling multiple ways of finding and comparing related documents in vector space. We evaluate our approach on medical journal articles. An interactive demo, our source code, and the evaluation datasets are available online: https://hpi.de/naumann//s/multifaceted-embeddings.

Eingebettetes YouTube-Video

Hinweis: Dieses eingebettete Video wird von YouTube, LLC, 901 Cherry Ave., San Bruno, CA 94066, USA bereitgestellt.
Beim Abspielen wird eine Verbindung zu den Servern von Youtube hergestellt. Dabei wird Youtube mitgeteilt, welche Seiten Sie besuchen. Wenn Sie in Ihrem Youtube-Account eingeloggt sind, kann Youtube Ihr Surfverhalten Ihnen persönlich zuzuordnen. Dies verhindern Sie, indem Sie sich vorher aus Ihrem Youtube-Account ausloggen.

Datenschutzerklärung Video anzeigen

Project-Related Publications

Risch, J., Alder, N., Hewel, C., Krestel, R.: PatentMatch: A Dataset for Matching Patent Claims & Prior Art. Proceedings of the 2nd Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech@SIGIR) (2021).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

Risch, J., Alder, N., Hewel, C., Krestel, R.: PatentMatch: A Dataset for Matching Patent Claims with Prior Art. ArXiv e-prints 2012.13919. (2020).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

Risch, J., Garda, S., Krestel, R.: Hierarchical Document Classification as a Sequence Generation Task. Proceedings of the Joint Conference on Digital Libraries (JCDL). pp. 147–155 (2020).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

Risch, J., Krestel, R.: Domain-specific word embeddings for patent classification. Data Technologies and Applications. 53, 108–122 (2019).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

Risch, J., Krestel, R.: Learning Patent Speak: Investigating Domain-Specific Word Embeddings. Proceedings of the Thirteenth International Conference on Digital Information Management (ICDIM). pp. 63–68 (2018).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

Multifaceted Domain-Specific Document Embeddings

Demo

Abstract

Project-Related Publications

Chair

News

17.11.2025 | New book chapter about "Data Quality for Enterprise AI" published

01.11.2025 | Paper accepted at WOP@ISWC

29.09.2025 | Paper accepted at NeurIPS 2025

29.09.2025 | Paper accepted at SIGMOD 2026

09.07.2025 | Paper accepted in SIGMOD Record

Project highlights

People and open positions