File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Statistical methods for causal inference and bias correction in genetic studies
Title | Statistical methods for causal inference and bias correction in genetic studies |
---|---|
Authors | |
Issue Date | 2024 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Wang, A. [王安琪]. (2024). Statistical methods for causal inference and bias correction in genetic studies. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | This thesis explores two topics in genetic studies: causal inference and
bias correction. Nowadays, understanding the biological mechanisms behind
complex traits is crucial in genetic studies. Mendelian Randomization (MR)
and Transcriptome-wide Association Studies (TWAS) leverage the increasing
availability of GWAS datasets to enhance our understanding of the causal re-
lationships between exposures and target traits within the framework of causal
inference. Meanwhile, outcome misclassification in Electronic Health Records
(EHR) presents another prevalent challenge in genetic studies, particularly
within the context of bias correction. This issue highlights the difficulties and
solutions involved in accurately interpreting the effects between the predictors
and target trait.
Specifically, MR is a statistical method exploiting genetic variants as in-
strumental variables to infer the causal relationships between modifiable risk
factors and the outcomes of interest. Despite wide uses of various popular two-
sample MR methods based on genome-wide association study summary level
data, however, those methods could suffer from potential power loss or/and
biased inference when the chosen genetic variants are in linkage disequilib-
rium (LD), and also have relatively large direct effects on the outcome whose
distribution might be heavy-tailed which is commonly referred to as the id-
iosyncratic pleiotropy phenomenon. To resolve those two issues, we propose
a novel Robust Bayesian Mendelian Randomization (RBMR) model that uses
the more robust multivariate generalized t-distribution to model such direct
effects in a probabilistic model framework which can also incorporate the LD
structure explicitly. The generalized t-distribution can be represented as a
Gaussian scaled mixture so that our model parameters can be estimated by
the expectation maximization (EM)-type algorithms. We compute the stan-
dard errors by calibrating the evidence lower bound using the likelihood ratio
test.
Furthermore, TWAS has recently gained popularity as a tool for identifying
causal genes, integrating eQTL data with GWAS results to investigate underly-
ing biological mechanisms at a tissue-specific level. Despite the advancement of
current TWAS methods to cover diverse traits, traditional approaches still face
two main challenges: (1) the lack of methods that can guarantee finite-sample
False Discovery Rate (FDR) control in identifying trait-associated genes; and
(2) the requirement for individual-level data, which is often inaccessible. To ad-
dress this challenge, we propose a powerful knockoff inference method termed
TWAS-GKF to identify candidate trait-associated genes with a guaranteed
finite-sample FDR control. TWAS-GKF introduces the main idea of Ghost-
knockoff inference to generate knockoff variables using only summary statis-
tics instead of individual-level data. In extensive studies, we demonstrate that
TWAS-GKF successfully controls the finite-sample FDR under a pre-specified
FDR level across all settings. We further apply TWAS-GKF to identify genes
in brain cerebellum tissue associated with schizophrenia (SCZ), and genes in
liver tissue related to low-density lipoprotein cholesterol (LDL-C), respectively.
The results reveal that the majority of the identified genes are validated by
Open Targets Validation Platform.
Additionally, disease phenotypes derived from Electronic Health Records
(EHR) are often subject to misclassification. Consequently, naive regression
analysis with a misclassified disease phenotype might produce biased asso-
ciation results. To address this critical issue, we propose a novel method to
correct bias in association studies with possibly misclassified EHR-derived phe-
notypes by explicitly modeling the misclassification sensitivity and specificity.
Our method does not require any external validation data which is costly and
often unavailable. We formally establish model identification conditions, and
then propose a scalable and computationally efficient weighted expectation-
maximization algorithm for estimation and inference. Extensive simulation
studies show that our proposed method has good empirical performance. We
apply our method to the EHR data from Michigan Genomics Initiative for
illustration. |
Degree | Doctor of Philosophy |
Subject | Genetics - Statistical methods Causation - Mathematical models Inference - Mathematical models |
Dept/Program | Statistics and Actuarial Science |
Persistent Identifier | http://hdl.handle.net/10722/353399 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Wang, Anqi | - |
dc.contributor.author | 王安琪 | - |
dc.date.accessioned | 2025-01-17T09:46:19Z | - |
dc.date.available | 2025-01-17T09:46:19Z | - |
dc.date.issued | 2024 | - |
dc.identifier.citation | Wang, A. [王安琪]. (2024). Statistical methods for causal inference and bias correction in genetic studies. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/353399 | - |
dc.description.abstract | This thesis explores two topics in genetic studies: causal inference and bias correction. Nowadays, understanding the biological mechanisms behind complex traits is crucial in genetic studies. Mendelian Randomization (MR) and Transcriptome-wide Association Studies (TWAS) leverage the increasing availability of GWAS datasets to enhance our understanding of the causal re- lationships between exposures and target traits within the framework of causal inference. Meanwhile, outcome misclassification in Electronic Health Records (EHR) presents another prevalent challenge in genetic studies, particularly within the context of bias correction. This issue highlights the difficulties and solutions involved in accurately interpreting the effects between the predictors and target trait. Specifically, MR is a statistical method exploiting genetic variants as in- strumental variables to infer the causal relationships between modifiable risk factors and the outcomes of interest. Despite wide uses of various popular two- sample MR methods based on genome-wide association study summary level data, however, those methods could suffer from potential power loss or/and biased inference when the chosen genetic variants are in linkage disequilib- rium (LD), and also have relatively large direct effects on the outcome whose distribution might be heavy-tailed which is commonly referred to as the id- iosyncratic pleiotropy phenomenon. To resolve those two issues, we propose a novel Robust Bayesian Mendelian Randomization (RBMR) model that uses the more robust multivariate generalized t-distribution to model such direct effects in a probabilistic model framework which can also incorporate the LD structure explicitly. The generalized t-distribution can be represented as a Gaussian scaled mixture so that our model parameters can be estimated by the expectation maximization (EM)-type algorithms. We compute the stan- dard errors by calibrating the evidence lower bound using the likelihood ratio test. Furthermore, TWAS has recently gained popularity as a tool for identifying causal genes, integrating eQTL data with GWAS results to investigate underly- ing biological mechanisms at a tissue-specific level. Despite the advancement of current TWAS methods to cover diverse traits, traditional approaches still face two main challenges: (1) the lack of methods that can guarantee finite-sample False Discovery Rate (FDR) control in identifying trait-associated genes; and (2) the requirement for individual-level data, which is often inaccessible. To ad- dress this challenge, we propose a powerful knockoff inference method termed TWAS-GKF to identify candidate trait-associated genes with a guaranteed finite-sample FDR control. TWAS-GKF introduces the main idea of Ghost- knockoff inference to generate knockoff variables using only summary statis- tics instead of individual-level data. In extensive studies, we demonstrate that TWAS-GKF successfully controls the finite-sample FDR under a pre-specified FDR level across all settings. We further apply TWAS-GKF to identify genes in brain cerebellum tissue associated with schizophrenia (SCZ), and genes in liver tissue related to low-density lipoprotein cholesterol (LDL-C), respectively. The results reveal that the majority of the identified genes are validated by Open Targets Validation Platform. Additionally, disease phenotypes derived from Electronic Health Records (EHR) are often subject to misclassification. Consequently, naive regression analysis with a misclassified disease phenotype might produce biased asso- ciation results. To address this critical issue, we propose a novel method to correct bias in association studies with possibly misclassified EHR-derived phe- notypes by explicitly modeling the misclassification sensitivity and specificity. Our method does not require any external validation data which is costly and often unavailable. We formally establish model identification conditions, and then propose a scalable and computationally efficient weighted expectation- maximization algorithm for estimation and inference. Extensive simulation studies show that our proposed method has good empirical performance. We apply our method to the EHR data from Michigan Genomics Initiative for illustration. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Genetics - Statistical methods | - |
dc.subject.lcsh | Causation - Mathematical models | - |
dc.subject.lcsh | Inference - Mathematical models | - |
dc.title | Statistical methods for causal inference and bias correction in genetic studies | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Statistics and Actuarial Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2025 | - |
dc.identifier.mmsid | 991044897476103414 | - |