HPI Digital Health Cluster

Unsupervised Subgroup Detection for Mixed-Type Systems Medicine Data Sets

Milena Kraus

The use of unsupervised clustering on omics data can find previously unknown cancer subtypes explaining clinical observations such as survival time. Recently, many methods to integrate and analyze cancer omics data sets have been developed. However, systems medicine consortia also focus on other complex diseases. Therefore, next to high throughput molecular omics data the acquired data includes environmental factors, anthropometric as well as clinical measures. Many of these data sources are of mixed-type, i.e., they contain continuous (e.g., expression data), discrete (mutation yes/no, read counts) and categorical (ethnicity, previous diagnoses) values. The integration of the additional detailed patient characteristics demands for the extension of existing or development of new methods that can be used in analyzing complex systems medicine data sets. 

Therefore the first objective is to describe the current landscape of clustering approaches for biomedical mixed-type data sets. We rely on the methodological reviews from Huang et al. (2017) and Bersanelli et al. (2016) both focusing on the calculations on omics data. We extend their work in pinpointing the methods that are capable of calculating on mixed-type data sets and introducing twelve additional methods. We also share details on algorithms that were developed in other domains, e.g., social studies or computer science, to tackle the problem of mixed-type data clustering. Furthermore, we provide an outlook on the development of new tools that perform disease subgroup detection on systems medicine data sets. As a result, we highlight theoretical strengths and flaws of the existing methods and show preliminary results of a practical evaluation on a systems medicine data set as assessed within the Systems Medicine Approach for Heart Failure (SMART) project.