File Download
Supplementary

postgraduate thesis: Statistical methods for causal inference and bias correction in genetic studies

TitleStatistical methods for causal inference and bias correction in genetic studies
Authors
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Wang, A. [王安琪]. (2024). Statistical methods for causal inference and bias correction in genetic studies. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractThis thesis explores two topics in genetic studies: causal inference and bias correction. Nowadays, understanding the biological mechanisms behind complex traits is crucial in genetic studies. Mendelian Randomization (MR) and Transcriptome-wide Association Studies (TWAS) leverage the increasing availability of GWAS datasets to enhance our understanding of the causal re- lationships between exposures and target traits within the framework of causal inference. Meanwhile, outcome misclassification in Electronic Health Records (EHR) presents another prevalent challenge in genetic studies, particularly within the context of bias correction. This issue highlights the difficulties and solutions involved in accurately interpreting the effects between the predictors and target trait. Specifically, MR is a statistical method exploiting genetic variants as in- strumental variables to infer the causal relationships between modifiable risk factors and the outcomes of interest. Despite wide uses of various popular two- sample MR methods based on genome-wide association study summary level data, however, those methods could suffer from potential power loss or/and biased inference when the chosen genetic variants are in linkage disequilib- rium (LD), and also have relatively large direct effects on the outcome whose distribution might be heavy-tailed which is commonly referred to as the id- iosyncratic pleiotropy phenomenon. To resolve those two issues, we propose a novel Robust Bayesian Mendelian Randomization (RBMR) model that uses the more robust multivariate generalized t-distribution to model such direct effects in a probabilistic model framework which can also incorporate the LD structure explicitly. The generalized t-distribution can be represented as a Gaussian scaled mixture so that our model parameters can be estimated by the expectation maximization (EM)-type algorithms. We compute the stan- dard errors by calibrating the evidence lower bound using the likelihood ratio test. Furthermore, TWAS has recently gained popularity as a tool for identifying causal genes, integrating eQTL data with GWAS results to investigate underly- ing biological mechanisms at a tissue-specific level. Despite the advancement of current TWAS methods to cover diverse traits, traditional approaches still face two main challenges: (1) the lack of methods that can guarantee finite-sample False Discovery Rate (FDR) control in identifying trait-associated genes; and (2) the requirement for individual-level data, which is often inaccessible. To ad- dress this challenge, we propose a powerful knockoff inference method termed TWAS-GKF to identify candidate trait-associated genes with a guaranteed finite-sample FDR control. TWAS-GKF introduces the main idea of Ghost- knockoff inference to generate knockoff variables using only summary statis- tics instead of individual-level data. In extensive studies, we demonstrate that TWAS-GKF successfully controls the finite-sample FDR under a pre-specified FDR level across all settings. We further apply TWAS-GKF to identify genes in brain cerebellum tissue associated with schizophrenia (SCZ), and genes in liver tissue related to low-density lipoprotein cholesterol (LDL-C), respectively. The results reveal that the majority of the identified genes are validated by Open Targets Validation Platform. Additionally, disease phenotypes derived from Electronic Health Records (EHR) are often subject to misclassification. Consequently, naive regression analysis with a misclassified disease phenotype might produce biased asso- ciation results. To address this critical issue, we propose a novel method to correct bias in association studies with possibly misclassified EHR-derived phe- notypes by explicitly modeling the misclassification sensitivity and specificity. Our method does not require any external validation data which is costly and often unavailable. We formally establish model identification conditions, and then propose a scalable and computationally efficient weighted expectation- maximization algorithm for estimation and inference. Extensive simulation studies show that our proposed method has good empirical performance. We apply our method to the EHR data from Michigan Genomics Initiative for illustration.
DegreeDoctor of Philosophy
SubjectGenetics - Statistical methods
Causation - Mathematical models
Inference - Mathematical models
Dept/ProgramStatistics and Actuarial Science
Persistent Identifierhttp://hdl.handle.net/10722/353399

 

DC FieldValueLanguage
dc.contributor.authorWang, Anqi-
dc.contributor.author王安琪-
dc.date.accessioned2025-01-17T09:46:19Z-
dc.date.available2025-01-17T09:46:19Z-
dc.date.issued2024-
dc.identifier.citationWang, A. [王安琪]. (2024). Statistical methods for causal inference and bias correction in genetic studies. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/353399-
dc.description.abstractThis thesis explores two topics in genetic studies: causal inference and bias correction. Nowadays, understanding the biological mechanisms behind complex traits is crucial in genetic studies. Mendelian Randomization (MR) and Transcriptome-wide Association Studies (TWAS) leverage the increasing availability of GWAS datasets to enhance our understanding of the causal re- lationships between exposures and target traits within the framework of causal inference. Meanwhile, outcome misclassification in Electronic Health Records (EHR) presents another prevalent challenge in genetic studies, particularly within the context of bias correction. This issue highlights the difficulties and solutions involved in accurately interpreting the effects between the predictors and target trait. Specifically, MR is a statistical method exploiting genetic variants as in- strumental variables to infer the causal relationships between modifiable risk factors and the outcomes of interest. Despite wide uses of various popular two- sample MR methods based on genome-wide association study summary level data, however, those methods could suffer from potential power loss or/and biased inference when the chosen genetic variants are in linkage disequilib- rium (LD), and also have relatively large direct effects on the outcome whose distribution might be heavy-tailed which is commonly referred to as the id- iosyncratic pleiotropy phenomenon. To resolve those two issues, we propose a novel Robust Bayesian Mendelian Randomization (RBMR) model that uses the more robust multivariate generalized t-distribution to model such direct effects in a probabilistic model framework which can also incorporate the LD structure explicitly. The generalized t-distribution can be represented as a Gaussian scaled mixture so that our model parameters can be estimated by the expectation maximization (EM)-type algorithms. We compute the stan- dard errors by calibrating the evidence lower bound using the likelihood ratio test. Furthermore, TWAS has recently gained popularity as a tool for identifying causal genes, integrating eQTL data with GWAS results to investigate underly- ing biological mechanisms at a tissue-specific level. Despite the advancement of current TWAS methods to cover diverse traits, traditional approaches still face two main challenges: (1) the lack of methods that can guarantee finite-sample False Discovery Rate (FDR) control in identifying trait-associated genes; and (2) the requirement for individual-level data, which is often inaccessible. To ad- dress this challenge, we propose a powerful knockoff inference method termed TWAS-GKF to identify candidate trait-associated genes with a guaranteed finite-sample FDR control. TWAS-GKF introduces the main idea of Ghost- knockoff inference to generate knockoff variables using only summary statis- tics instead of individual-level data. In extensive studies, we demonstrate that TWAS-GKF successfully controls the finite-sample FDR under a pre-specified FDR level across all settings. We further apply TWAS-GKF to identify genes in brain cerebellum tissue associated with schizophrenia (SCZ), and genes in liver tissue related to low-density lipoprotein cholesterol (LDL-C), respectively. The results reveal that the majority of the identified genes are validated by Open Targets Validation Platform. Additionally, disease phenotypes derived from Electronic Health Records (EHR) are often subject to misclassification. Consequently, naive regression analysis with a misclassified disease phenotype might produce biased asso- ciation results. To address this critical issue, we propose a novel method to correct bias in association studies with possibly misclassified EHR-derived phe- notypes by explicitly modeling the misclassification sensitivity and specificity. Our method does not require any external validation data which is costly and often unavailable. We formally establish model identification conditions, and then propose a scalable and computationally efficient weighted expectation- maximization algorithm for estimation and inference. Extensive simulation studies show that our proposed method has good empirical performance. We apply our method to the EHR data from Michigan Genomics Initiative for illustration.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshGenetics - Statistical methods-
dc.subject.lcshCausation - Mathematical models-
dc.subject.lcshInference - Mathematical models-
dc.titleStatistical methods for causal inference and bias correction in genetic studies-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineStatistics and Actuarial Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2025-
dc.identifier.mmsid991044897476103414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats