Hasso-Plattner-Institut
Prof. Dr. Christoph Lippert
 

Causal inference in prediction models

In epidemiology, causal inference and prediction modeling methodologies have been historically distinct. Directed Acyclic Graphs (DAGs) are used to model a priori causal assumptions and inform variable selection strategies for causal questions. Although tools originally designed for prediction are finding applications in causal inference, the counterpart has remained largely unexplored. The aim of this theoretical and simulation-based study is to assess the potential benefit of using DAGs in clinical risk prediction modeling.

The results show that a single-predictor model in the causal direction is likely to have better transportability than one in the anticausal direction in some scenarios. We empirically show that the Markov Blanket, the set of variables including the parents, children, and parents of the children of the outcome node in a DAG, is the optimal set of predictors for that outcome.

These findings provide a theoretical basis for the intuition that a diagnostic clinical risk prediction model including causes as predictors is likely to be more transportable. Furthermore, using DAGs to identify Markov Blanket variables may be a useful, efficient strategy to select predictors in clinical risk prediction models if strong knowledge of the underlying causal structure exists or can be learned.

In a current application, we have proposed a causal framework to investigate the transportability of prediction models on Alzheimer's disease in simulated external settings with different distributions of demographic and clinical characteristics. In an ongoing follow-up project, we are investigating the transportability of prediction models on Alzheimer's disease empirically using different populations from studies in the US and South Korea.

Further, we are focusing on prognostic clinical risk prediction models for endometriosis. We are performing a systematic review of existing prediction models, externally validating them on data from UK Biobank, Mount Sinai, health insurance datasets, and NAKO data, and then updating and further developing them.

 

References: 

  • Piccininni M, Konigorski S, Rohmann JL, Kurth T (2020). Directed Acyclic Graphs and causal thinking in clinical risk prediction modeling. BMC Medical Research Methodology 20: 179. https://doi.org/10.1186/s12874-020-01058-z.
  • Fehr J, Piccininni M, Kurth T, Konigorski S (2023). Assessing the transportability of clinical prediction models for cognitive impairment using causal models. BMC Medical Research Methodology23:187. https://doi.org/10.1186/s12874-023-02003-6  

 

Team:

 

Collaboration partners:


Transportability of genetic/transcriptomic prediction models

Genome-wide association studies (GWAS) have identified a large number of genetic variants associated with phenotypies. However, GWAS loci are often difficult to interpret and knowledge about causal pathways is not available. Transcriptome-wide association studies (TWAS) have been proposed to that aim, and employ prediction models for gene expression in order to test for associations with phenotypes. However, the reference panels for TWAS have been mostly composed of individuals of European ancestry. Here, we evaluate the transportability of such existing prediction models to data from the Korean population, and highlight the need for population-specific prediction models.

 

Team:

 

Collaboration partners:


Validation and development of prediction models for endometriosis

Menstrual pain is one of the most common health problems in young women, with an estimated prevalence of 67% to 75%. Pronounced menstrual pain is an early symptom of endometriosis, a complex hormone-dependent disease in which endometrial-like tissue grows outside the uterus. There is a high burden of non-diagnosis and a long time until diagnosis of endometriosis. Hence, endometriosis presents a high burden to women as well as to the health system. We focus here on prognostic prediction models of the risk of endometriosis, a field that has also been highly underresearched. We perform a systematic literature review, identify existing risk prediction models for endometriosis, extract published prediction models and externally validate them on the data from the UK Biobank, electronic health records from Mount Sinai hospital, health insurance data in Germany, and data from the NaKo Gesundheitsstudie. 

 

Team:

 

Collaboration partners:

  • Health insurances: BARMER, DAK-Gesundheit, Techniker Krankenkasse
  • Vandage GmbH
  • WebMen Internet GmbH
  • Charité University Medicine Berlin  

Estimating and testing in directed acyclic graphs

Overview:

In genetic association studies and in association studies in general, it is important to distinguish direct and indirect effects in order to build truly functional models. For this purpose, we consider a directed acyclic graph setting with interventions (here: genetic variants), primary and intermediate outcomes, and confounding factors.

In order to make valid statistical inference on direct genetic effects on the primary outcome variable, it is necessary to consider all potential effects in the graph, and we propose to use the estimating equations method with robust Huber–White sandwich standard errors. We evaluate the proposed causal inference based on estimating equations (CIEE) method and compare it with traditional multiple regression methods, the structural equation modeling method, and sequential G-estimation methods through a simulation study for the analysis of (completely observed) quantitative traits and time-to-event traits subject to censoring as primary outcome variables.

The results show that CIEE provides valid estimators and inference by successfully removing the effect of intermediate variables from the primary outcome and is robust against measured and unmeasured confounding of the indirect effect through observed factors. All other methods except the sequential G-estimation method for quantitative traits fail in some scenarios where their test statistics yield inflated type I errors. In the analysis of the Genetic Analysis Workshop 19 dataset, we estimate and test genetic effects on blood pressure accounting for intermediate gene expression. The results show that CIEE can identify genetic variants that would be missed by traditional regression analyses. CIEE is computationally fast, widely applicable to different fields, and available as an R package.

 

References:

  • Konigorski S, Wang Y, Cigsar C, Yilmaz YE (2018). Estimating and testing direct genetic effects in directed acyclic graphs using estimating equations. Genetic Epidemiology 42: 174–186. https://doi.org/10.1002/gepi.22107.
  • Konigorski S, Yilmaz YE. CIEE: Estimating and testing direct effects in directed acyclic graphs using estimating equations. R package version 0.1.1.  https://CRAN.R-project.org/package=CIEE.
  • Konigorski S (2021). Causal inference in developmental medicine and neurology. Developmental Medicine & Child Neurology  63(5):498.  https://doi.org/10.1111/dmcn.14813.

 

Team: