1.
Bressem, K.K., Papaioannou, J.-M., Grundmann, P., Borchert, F., Adams, L.C., Liu, L., Busch, F., Xu, L., Loyen, J.P., Niehues, S.M., Augustin, M., Grosser, L., Makowski, M.R., Aerts, H.J., Löser, A.: medBERT.de: A Comprehensive German BERT Model for the Medical Domain. Expert Systems with Applications. 121598 (2023).
This paper presents medBERT.de, a pre-trained German BERT model specifically designed for the German medical domain. The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the overall performance of the model, this paper also conducts a more in-depth analysis of its capabilities. We investigate the impact of data deduplication on the model's performance, as well as the potential benefits of using more efficient tokenization methods. Our results indicate that domain-specific models such as medBERT.de are particularly useful for longer texts, and that deduplication of training data does not necessarily lead to improved performance. Furthermore, we found that efficient tokenization plays only a minor role in improving model performance, and attribute most of the improved performance to the large amount of training data. To encourage further research, the pre-trained model weights and new benchmarks based on radiological data are made publicly available for use by the scientific community.
2.
Schapranow, M.-P., Borchert, F., Bougatf, N., Hund, H., Eils, R.: Software-Tool Support for Collaborative, Virtual, Multi-Site Molecular Tumor Boards. SN Computer Science. 4, 358 (2023).
3.
Borchert, F., Llorca, I., Schapranow, M.-P.: Cross-Lingual Candidate Retrieval and Re-ranking for Biomedical Entity Linking. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Giachanou, A., Li, D., Aliannejadi, M., Vlachos, M., Faggioli, G., en Ferro, N. (reds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. bll. 135–147. Springer Nature Switzerland, Cham (2023).
4.
Ladas, N., Borchert, F., Franz, S., Rehberg, A., Strauch, N., Sommer, K.K., Marschollek, M., Gietzelt, M.: Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts. Health Informatics Journal. 29, 14604582231164696 (2023).
5.
Steinwand, S., Borchert, F., Winkler, S., Schapranow, M.-P.: GGTWEAK: Gene Tagging with Weak Supervision for German Clinical Text. In: Juarez, J.M., Marcos, M., Stiglic, G., en Tucker, A. (reds.) Artificial Intelligence in Medicine. bll. 183–192. Springer Nature Switzerland, Cham (2023).
6.
Hugo, J., Ibing, S., Borchert, F., Sachs, J.P., Cho, J., Ungaro, R.C., Böttinger, E.P.: Machine Learning Based Prediction of Incident Cases of Crohn’s Disease Using Electronic Health Records from a Large Integrated Health System. In: Juarez, J.M., Marcos, M., Stiglic, G., en Tucker, A. (reds.) Artificial Intelligence in Medicine. bll. 293–302. Springer Nature Switzerland, Cham (2023).
7.
Kämmer, N., and Borchert, F., and Winkler, S., and de Melo, G., and Schapranow, M.-P.: Resolving Elliptical Compounds in German Medical Text. The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. bll. 292–305. Association for Computational Linguistics, Toronto, Canada (2023).
8.
Llorca, I., Borchert, F., Schapranow, M.-P.: A Meta-dataset of German Medical Corpora: Harmonization of Annotations and Cross-corpus NER Evaluation. Proceedings of the 5th Clinical Natural Language Processing Workshop. bll. 171–181. Association for Computational Linguistics, Toronto, Canada (2023).
Over the last years, an increasing number of publicly available, semantically annotated medical corpora have been released for the German language. While their annotations cover comparable semantic classes, the synergies of such efforts have not been explored, yet. This is due to substantial differences in the data schemas (syntax) and annotated entities (semantics), which hinder the creation of common meta-datasets. For instance, it is unclear whether named entity recognition (NER) taggers trained on one or more of such datasets are useful to detect entities in any of the other datasets. In this work, we create harmonized versions of German medical corpora using the BigBIO framework, and make them available to the community. Using these as a meta-dataset, we perform a series of cross-corpus evaluation experiments on two settings of aligned labels. These consist in fine-tuning various pre-trained Transformers on different combinations of training sets, and testing them against each dataset separately. We find that a) trained NER models generalize poorly, with F1 scores dropping approx. 20 pp. on unseen test data, and b) current pre-trained Transformer models for the German language do not systematically alleviate this issue. However, our results suggest that models benefit from additional training corpora in most cases, even if these belong to different medical fields or text genres.
9.
Richter-Pechanski, P., Wiesenbach, P., Schwab, D.M., Kiriakou, C., He, M., Allers, M.M., Tiefenbacher, A.S., Kunz, N., Martynova, A., Spiller, N., Mierisch, J., Borchert, F., Schwind, C., Frey, N., Dieterich, C., Geis, N.A.: A Distributable German Clinical Corpus Containing Cardiovascular Clinical Routine Doctor’s Letters. Scientific Data. 10, 207 (2023).
We present CARDIO:DE, the first freely available and distributable large German clinical corpus from the cardiovascular domain. CARDIO:DE encompasses 500 clinical routine German doctor's letters from Heidelberg University Hospital, which were manually annotated. Our prospective study design complies well with current data protection regulations and allows us to keep the original structure of clinical documents consistent. In order to ease access to our corpus, we manually de-identified all letters. To enable various information extraction tasks the temporal information in the documents was preserved. We added two high-quality manual annotation layers to CARDIO:DE, (1) medication information and (2) CDA-compliant section classes. To the best of our knowledge, CARDIO:DE is the first freely available and distributable German clinical corpus in the cardiovascular domain. In summary, our corpus offers unique opportunities for collaborative and reproducible research on natural language processing models for German clinical texts.