Hasso-Plattner-Institut25 Jahre HPI
Hasso-Plattner-Institut25 Jahre HPI

Improving the Linguistic Capabilities of Vision-and-Language Models

Marco Cipriano

Chair of Artificial Intelligence and Intelligent Systems
Hasso Plattner Institute

Supervisor: Prof. Dr. Gerard De Melo



Tel+49 0331 5509-3469


Research Interests

My research focuses in general on improving the performances of Vision-and-Language models. In particular, I am currently focusing on Visual question answering (VQA) for medical imaging. I am also interested in working with multilingual models.

> Vision-and-Language Models

> Visual Question Answering

> Computer Vision

> Medical Imaging



Medical Visual Question Answering and Segmentation


Visual Question Answering (VQA) is a task in which a model is given an image and a natural language question about the image and the model must generate a natural language answer. VQA is a challenging multimodal task that requires a model to have a deep understanding of both the visual and textual information in the input. In recent years, there has been a growing interest in applying VQA to medical data [1, 2, 3]. This has considerable potential to benefit healthcare systems, as it may aid clinicians in interpreting medical images, obtaining more accurate diagnoses, and ultimately, may improve patient care. Two main benchmarks are commonly used in the literature for evaluating medical VQA models: VQA-RAD [4] which consists of 315 medical images and 3515 question-answer pairs and SLAKE [5] with 642 images and more than 7000 question-answer pairs in English and Chinese. Both datasets cover different acquisition modalities and organs.

An example of input image-text pair for different medical VQA models and their answers (image from [3])

Segmentation is the process of dividing an image into multiple segments or regions, each of which corresponds to a different object or structure in the image. In medical imaging, segmentation is used to identify and isolate specific structures of interest, such as organs, tumors, or blood vessels. The task is important for diagnosis and treatment planning, as it allows clinicians to measure and analyze the structures of interest more accurately. 

Since most publicly available annotated datasets for medical segmentation are specific to a single target, accurately segmenting multiple organs from a single image is a challenge that is still being addressed in the most recent literature [6]. 

This project aims to enable VQA with better segmentation capabilities. Our motivation is that locating specific organs or abnormalities can be a crucial factor for a VQA model to correctly identify the answer to medical questions.


We created a large multi-organ dataset by collecting, extracting, and merging annotated images from many 3D public datasets. We have extracted slices out of 3D CT and MRI scans to create a dataset of more than 23 thousand annotated images covering the brain, heart, spleen, kidneys, bladder, lungs, and liver. We plan to increase the number of images and covered organs in the future. 

Generating this dataset required laborious and careful work where we considered label and background inconsistency and different value distributions between datasets. The original data also came with NiFTi or DICOM format requiring additional preprocessing.

The current target organs were chosen to reflect the most common and relevant organs for the VQA benchmarks.

Overview of the different data sources we are using in this project, sorted by organ type. entries with a star required additional sanity check.

Current state of the project

we have a working multi-organ segmentation model that are currently using to enable VQA with better segmentation capabilities and organ awareness.


Qualitative results of the segmentation model on our test-set. Brain in red, Kidney in dark red, Heart in green, Lungs in dark green, Liver in purple.


1. Binh D Nguyen, Thanh-Toan Do, Binh X Nguyen, Tuong Do, Erman Tjiputra, and Quang D Tran. Overcoming data limitation in medical visual question answering. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 522–530. Springer, 2019.

2. Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. arXiv preprint arXiv:1805.07932, 2018. 

3. Eslami, S., de Melo, G., & Meinel, C. (2021). Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?. arXiv preprint arXiv:2112.13906.

4. Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018. 

5. Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021a. 

6. Liu, J., Zhang, Y., Chen, J. N., Xiao, J., Lu, Y., Landman, B. A., ... & Zhou, Z. (2023). CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection. arXiv preprint arXiv:2301.00785.



  1. Cipriano, M., Allegretti, S., Bolelli, F., Pollastri, F., & Grana, C. (2022). Improving segmentation of the inferior alveolar nerve through deep label propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  2. Cipriano, M., Allegretti, S., Bolelli, F., Di Bartolomeo, M., Pollastri, F., Pellacani, A., ... & Grana, C. (2022). Deep segmentation of the mandibular canal: a new 3d annotated dataset of CBCT volumes. IEEE Access.
  3. Pollastri, F., Cipriano, M., Bolelli, F., & Grana, C. (2022, March). Long-range 3d self-attention for mri prostate segmentation. In 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI). IEEE.
  4. Mercadante, C., Cipriano, M., Bolelli, F., Pollastri, F., Anesi, A., & Grana, C. (2021). A cone beam computed tomography annotation tool for automatic detection of the inferior alveolar nerve canal. In 16th International Conference on Computer Vision Theory and Applications-VISAPP 2021. SciTePress.
  5. Vincenzi, S., Porrello, A., Buzzega, P., Cipriano, M., Fronte, P., Cuccu, R., ... & Calderara, S. (2021, January). The color out of space: learning self-supervised representations for earth observation imagery. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE.