Ralf Krestel, Julian Risch
Our paper "Learning Patent Speak: Investigating Domain-Specific Word Embeddings" has been accepted for presentation at the International Conference on Digital Information Management (ICDIM), which takes place in Berlin this year. The publication can be downloaded here. It extends our work on text classification of patent documents and focuses on word embeddings trained for this particular task. We trained word embeddings on a corpus of 38 billion tokens and made them publicly available here.
Learning Patent Speak: Investigating Domain-Specific Word Embeddings
A patent examiner needs domain-specific knowledge to classify a patent application according to its field of invention. Standardized classification schemes help to compare a patent application to previously granted patents and thereby check its novelty. Due to the large volume of patents, automatic patent classification would be highly beneficial to patent offices and other stakeholders in the patent domain. However, a challenge for the automation of this costly manual task is the patent-specific language use. To facilitate this task, we present domain-specific pre-trained word embeddings for the patent domain. We trained our model on a very large dataset of more than 5 million patents to learn the language use in this domain. We evaluated the quality of the resulting embeddings in the context of patent classification. To this end, we propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings. Experiments on a standardized evaluation dataset show that our approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches.