Rasheed, A., Borchert, F., Kohlmeyer, L., Henkenjohann, R., Schapranow, M.-P.: A Comparison of Concept Embeddings for German Clinical Corpora. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). bll. 2314–2321 (2021).
Clinical concept embeddings enable unsupervised learning of relationships among medical concepts. A range of benchmarks quantifies the degree to which learned representations capture medical semantics. However, training and evaluation of embeddings require a large amount of data. In addition, embeddings’ benchmark score varies in different languages because it differs with the size of the available corpora. Multi-modal data increases the corpus size, but data protection regulations limit access to clinical multi-modal data. We present an extendable pipeline for training clinical concept embeddings on various text corpora and evaluating the quality of trained embeddings on selected benchmark tasks. Our work provides different ways to identify clinical concepts in textual corpora. We train embeddings on selected German clinical text corpora and evaluate them on various benchmark scores. Our work can be extended to train embeddings in other languages in which a large multi-modal dataset is not available.