Statistical methods for causal inference and bias correction in genetic studies

Wang, Anqi; 王安琪

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Statistics & Actuarial Science: Theses

postgraduate thesis: Statistical methods for causal inference and bias correction in genetic studies

Title	Statistical methods for causal inference and bias correction in genetic studies
Authors	Wang, Anqi 王安琪
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Wang, A. [王安琪]. (2024). Statistical methods for causal inference and bias correction in genetic studies. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	This thesis explores two topics in genetic studies: causal inference and bias correction. Nowadays, understanding the biological mechanisms behind complex traits is crucial in genetic studies. Mendelian Randomization (MR) and Transcriptome-wide Association Studies (TWAS) leverage the increasing availability of GWAS datasets to enhance our understanding of the causal re- lationships between exposures and target traits within the framework of causal inference. Meanwhile, outcome misclassification in Electronic Health Records (EHR) presents another prevalent challenge in genetic studies, particularly within the context of bias correction. This issue highlights the difficulties and solutions involved in accurately interpreting the effects between the predictors and target trait. Specifically, MR is a statistical method exploiting genetic variants as in- strumental variables to infer the causal relationships between modifiable risk factors and the outcomes of interest. Despite wide uses of various popular two- sample MR methods based on genome-wide association study summary level data, however, those methods could suffer from potential power loss or/and biased inference when the chosen genetic variants are in linkage disequilib- rium (LD), and also have relatively large direct effects on the outcome whose distribution might be heavy-tailed which is commonly referred to as the id- iosyncratic pleiotropy phenomenon. To resolve those two issues, we propose a novel Robust Bayesian Mendelian Randomization (RBMR) model that uses the more robust multivariate generalized t-distribution to model such direct effects in a probabilistic model framework which can also incorporate the LD structure explicitly. The generalized t-distribution can be represented as a Gaussian scaled mixture so that our model parameters can be estimated by the expectation maximization (EM)-type algorithms. We compute the stan- dard errors by calibrating the evidence lower bound using the likelihood ratio test. Furthermore, TWAS has recently gained popularity as a tool for identifying causal genes, integrating eQTL data with GWAS results to investigate underly- ing biological mechanisms at a tissue-specific level. Despite the advancement of current TWAS methods to cover diverse traits, traditional approaches still face two main challenges: (1) the lack of methods that can guarantee finite-sample False Discovery Rate (FDR) control in identifying trait-associated genes; and (2) the requirement for individual-level data, which is often inaccessible. To ad- dress this challenge, we propose a powerful knockoff inference method termed TWAS-GKF to identify candidate trait-associated genes with a guaranteed finite-sample FDR control. TWAS-GKF introduces the main idea of Ghost- knockoff inference to generate knockoff variables using only summary statis- tics instead of individual-level data. In extensive studies, we demonstrate that TWAS-GKF successfully controls the finite-sample FDR under a pre-specified FDR level across all settings. We further apply TWAS-GKF to identify genes in brain cerebellum tissue associated with schizophrenia (SCZ), and genes in liver tissue related to low-density lipoprotein cholesterol (LDL-C), respectively. The results reveal that the majority of the identified genes are validated by Open Targets Validation Platform. Additionally, disease phenotypes derived from Electronic Health Records (EHR) are often subject to misclassification. Consequently, naive regression analysis with a misclassified disease phenotype might produce biased asso- ciation results. To address this critical issue, we propose a novel method to correct bias in association studies with possibly misclassified EHR-derived phe- notypes by explicitly modeling the misclassification sensitivity and specificity. Our method does not require any external validation data which is costly and often unavailable. We formally establish model identification conditions, and then propose a scalable and computationally efficient weighted expectation- maximization algorithm for estimation and inference. Extensive simulation studies show that our proposed method has good empirical performance. We apply our method to the EHR data from Michigan Genomics Initiative for illustration.
Degree	Doctor of Philosophy
Subject	Genetics - Statistical methods Causation - Mathematical models Inference - Mathematical models
Dept/Program	Statistics and Actuarial Science
Persistent Identifier	http://hdl.handle.net/10722/353399

DC Field	Value	Language
dc.contributor.author	Wang, Anqi	-
dc.contributor.author	王安琪	-
dc.date.accessioned	2025-01-17T09:46:19Z	-
dc.date.available	2025-01-17T09:46:19Z	-
dc.date.issued	2024	-
dc.identifier.citation	Wang, A. [王安琪]. (2024). Statistical methods for causal inference and bias correction in genetic studies. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/353399	-
dc.description.abstract	This thesis explores two topics in genetic studies: causal inference and bias correction. Nowadays, understanding the biological mechanisms behind complex traits is crucial in genetic studies. Mendelian Randomization (MR) and Transcriptome-wide Association Studies (TWAS) leverage the increasing availability of GWAS datasets to enhance our understanding of the causal re- lationships between exposures and target traits within the framework of causal inference. Meanwhile, outcome misclassification in Electronic Health Records (EHR) presents another prevalent challenge in genetic studies, particularly within the context of bias correction. This issue highlights the difficulties and solutions involved in accurately interpreting the effects between the predictors and target trait. Specifically, MR is a statistical method exploiting genetic variants as in- strumental variables to infer the causal relationships between modifiable risk factors and the outcomes of interest. Despite wide uses of various popular two- sample MR methods based on genome-wide association study summary level data, however, those methods could suffer from potential power loss or/and biased inference when the chosen genetic variants are in linkage disequilib- rium (LD), and also have relatively large direct effects on the outcome whose distribution might be heavy-tailed which is commonly referred to as the id- iosyncratic pleiotropy phenomenon. To resolve those two issues, we propose a novel Robust Bayesian Mendelian Randomization (RBMR) model that uses the more robust multivariate generalized t-distribution to model such direct effects in a probabilistic model framework which can also incorporate the LD structure explicitly. The generalized t-distribution can be represented as a Gaussian scaled mixture so that our model parameters can be estimated by the expectation maximization (EM)-type algorithms. We compute the stan- dard errors by calibrating the evidence lower bound using the likelihood ratio test. Furthermore, TWAS has recently gained popularity as a tool for identifying causal genes, integrating eQTL data with GWAS results to investigate underly- ing biological mechanisms at a tissue-specific level. Despite the advancement of current TWAS methods to cover diverse traits, traditional approaches still face two main challenges: (1) the lack of methods that can guarantee finite-sample False Discovery Rate (FDR) control in identifying trait-associated genes; and (2) the requirement for individual-level data, which is often inaccessible. To ad- dress this challenge, we propose a powerful knockoff inference method termed TWAS-GKF to identify candidate trait-associated genes with a guaranteed finite-sample FDR control. TWAS-GKF introduces the main idea of Ghost- knockoff inference to generate knockoff variables using only summary statis- tics instead of individual-level data. In extensive studies, we demonstrate that TWAS-GKF successfully controls the finite-sample FDR under a pre-specified FDR level across all settings. We further apply TWAS-GKF to identify genes in brain cerebellum tissue associated with schizophrenia (SCZ), and genes in liver tissue related to low-density lipoprotein cholesterol (LDL-C), respectively. The results reveal that the majority of the identified genes are validated by Open Targets Validation Platform. Additionally, disease phenotypes derived from Electronic Health Records (EHR) are often subject to misclassification. Consequently, naive regression analysis with a misclassified disease phenotype might produce biased asso- ciation results. To address this critical issue, we propose a novel method to correct bias in association studies with possibly misclassified EHR-derived phe- notypes by explicitly modeling the misclassification sensitivity and specificity. Our method does not require any external validation data which is costly and often unavailable. We formally establish model identification conditions, and then propose a scalable and computationally efficient weighted expectation- maximization algorithm for estimation and inference. Extensive simulation studies show that our proposed method has good empirical performance. We apply our method to the EHR data from Michigan Genomics Initiative for illustration.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Genetics - Statistical methods	-
dc.subject.lcsh	Causation - Mathematical models	-
dc.subject.lcsh	Inference - Mathematical models	-
dc.title	Statistical methods for causal inference and bias correction in genetic studies	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Statistics and Actuarial Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2025	-
dc.identifier.mmsid	991044897476103414	-

File Download

Supplementary

postgraduate thesis: Statistical methods for causal inference and bias correction in genetic studies

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats