Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Multifaceted Domain-Specific Document Embeddings

This page corresponds to our demo paper titled "Multifaceted Domain-Specific Document Embeddings" by Julian Risch, Philipp Hager, and Ralf Krestel. It has been accepted for presentation at NAACL'21 and is based on a Master's thesis by Philipp Hager.

Demo

To show the practical feasibility of our approach, we implemented a demo, which can be accessed here. Our source code and the evaluation datasets are available on GitHub and a screencast is on YouTube.

Abstract

Word and document embeddings are Natural Language Processing (NLP) techniques that map words to fixed-length numerical vectors in an embedding space. Current embedding algorithms work well when trained on large text corpora, but fail to produce high-quality vectors when given a small number of documents or are confronted with uncommon terms, as is often the case in specialized domains. Secondly, it is common to blend the entire document into a single embedding vector, making it hard to find documents relating only to a specific piece of information or to explain why two documents are considered similar. In this work, we propose a novel approach to train document embeddings for domain-specific texts. We use a siamese neural network architecture in combination with knowledge graphs to train document embeddings on a small number of training examples from the medical domain. The model identifies different types of domain knowledge and encodes them into separate dimensions of our embedding, thereby enabling multiple ways of finding and comparing related documents in vector space. We evaluate our approach on medical journal articles. An interactive demo, our source code, and the evaluation datasets are available online:  https://hpi.de/naumann//s/multifaceted-embeddings.

Project-Related Publications

  • 1.
    Risch, J., Alder, N., Hewel, C., Krestel, R.: PatentMatch: A Dataset for Matching Patent Claims & Prior Art. Proceedings of the 2nd Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech@SIGIR) (2021).
     
  • 2.
    Risch, J., Alder, N., Hewel, C., Krestel, R.: PatentMatch: A Dataset for Matching Patent Claims with Prior Art. ArXiv e-prints 2012.13919. (2020).
     
  • 3.
    Risch, J., Garda, S., Krestel, R.: Hierarchical Document Classification as a Sequence Generation Task. Proceedings of the Joint Conference on Digital Libraries (JCDL). pp. 147–155 (2020).
     
  • 4.
    Risch, J., Krestel, R.: Domain-specific word embeddings for patent classification. Data Technologies and Applications. 53, 108–122 (2019).
     
  • 5.
    Risch, J., Krestel, R.: Learning Patent Speak: Investigating Domain-Specific Word Embeddings. Proceedings of the Thirteenth International Conference on Digital Information Management (ICDIM). pp. 63–68 (2018).