Prof. Dr. Felix Naumann

PatentMatch: A Dataset for Matching Patent Claims & Prior Art

This page provides the dataset and source code corresponding to the paper "PatentMatch: A Dataset for Matching Patent Claims & Prior Art" by Julian Risch, Nicolas Alder, Christoph Hewel and Ralf Krestel. It is published on arxiv.org and in the proceedings of the PatentSemTech workshop co-located with SIGIR 2021.

The source code and a detailed descritpion of the data collection process are available on GitHub: https://github.com/julian-risch/PatentMatch

A text-pair classification example that uses the FARM framework to fine-tune a BERT model on the dataset is also available on GitHub: https://github.com/julian-risch/PatentMatch-FARM

The PatentMatch dataset is split into training set and test set files (we recommend cross-validation). The following files can be downloaded here (1.9 GB):

  • Training
  • Test
  • Training balanced
  • Test balanced
  • Training ultra balanced
  • Test ultra balanced
  • DPR training
  • DPR test


Patent examiners need to solve a complex information retrieval task when they assess the novelty and inventive step of claims made in a patent application. Given a claim, they search for prior art, which comprises all relevant publicly available information. This time-consuming task requires a deep understanding of the respective technical domain and the patent-domain-specific language. For these reasons, we address the computer-assisted search for prior art by creating a training dataset for supervised machine learning called PatentMatch. It contains pairs of claims from patent applications and semantically corresponding text passages of different degrees from cited patent documents. Each pair has been labeled by technically-skilled patent examiners from the European Patent Office. Accordingly, the label indicates the degree of semantic correspondence (matching), i.e., whether the text passage is prejudicial to the novelty of the claimed invention or not. Preliminary experiments using a baseline system show that PatentMatch can indeed be used for training a binary text pair classifier on this challenging information retrieval task. The dataset is available online: https://hpi.de/naumann/s/patentmatch

Project-Related Publications

  • 1.
    Risch, J., Alder, N., Hewel, C., Krestel, R.: PatentMatch: A Dataset for Matching Patent Claims & Prior Art. Proceedings of the 2nd Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech@SIGIR) (2021).
  • 2.
    Risch, J., Alder, N., Hewel, C., Krestel, R.: PatentMatch: A Dataset for Matching Patent Claims with Prior Art. ArXiv e-prints 2012.13919. (2020).
  • 3.
    Risch, J., Garda, S., Krestel, R.: Hierarchical Document Classification as a Sequence Generation Task. Proceedings of the Joint Conference on Digital Libraries (JCDL). pp. 147–155 (2020).
  • 4.
    Risch, J., Krestel, R.: Domain-specific word embeddings for patent classification. Data Technologies and Applications. 53, 108–122 (2019).
  • 5.
    Risch, J., Krestel, R.: Learning Patent Speak: Investigating Domain-Specific Word Embeddings. Proceedings of the Thirteenth International Conference on Digital Information Management (ICDIM). pp. 63–68 (2018).