Julian Risch and Ralf Krestel
Our article "Domain-Specific Word Embeddings for Patent Classification" has been accepted for publication in the Data Technologies and Applications Journal. This article is part of our ongoing researchin the domain of patent analysis. It extends our previous conference paper "Learning Patent Speak: Investigating Domain-Specific Word Embeddings" with a detailed error analysis and further investigates our model's strengths and weaknesses. The pre-print of the journal article can be found here.
Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. To examine the novelty of an application it can then be compared to previously granted patents in the same class. Automatic classification would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed to accomplish this costly manual task. However, a challenge for the automation is patent-specific language use, such as special vocabulary and phrases. To account for this language use, we present domain-specific pre-trained word embeddings for the patent domain. We train our model on a very large dataset of more than 5 million patents and evaluate it at the task of patent classification. To this end, we propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings. Experiments on a standardized evaluation dataset show that our approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches. In this paper, we further investigate the model’s strengths and weaknesses. An extensive error analysis reveals that the learned embeddings indeed mirror patent-specific language use. The imbalanced training data and underrepresented classes are the most difficult remaining challenge