Functional annotation, prioritization and enrichment analysis of human regulatory variants

Yao, Hongcheng; 姚宏成

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Biomedical Sciences: Theses

postgraduate thesis: Functional annotation, prioritization and enrichment analysis of human regulatory variants

Title	Functional annotation, prioritization and enrichment analysis of human regulatory variants
Authors	Yao, Hongcheng 姚宏成
Advisors	Advisor(s):Sham, PC Xia, Z
Issue Date	2021
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yao, H. [姚宏成]. (2021). Functional annotation, prioritization and enrichment analysis of human regulatory variants. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Regulatory variants are vital for regulation of gene expression and are involved in the disease pathogenesis and trait development. In the past decade, genome-wide association studies (GWASs) and expression quantitative trait loci (eQTL) studies have identified numerous associated variants and a large proportion of them are located in noncoding regions, indicating their potential role as regulatory variants. However, the precise identification and interpretation of noncoding regulatory variants by experimental validation are costly and labor-intensive, which hampers the illumination of the underlying mechanisms of diseases/traits. Fortunately, functional annotations like histone modification profiles can indicate the existence and potential function of regulatory variants and there is a surge of genomic, transcriptomic and epigenomic profiling studies across diverse tissues/cell types in recent years. As a result, variant annotation becomes a key step in the analysis of regulatory variants. Furthermore, based on the functional annotations, computational methods are developed to perform in silico prediction and prioritization of regulatory variants, while enrichment analysis is applied to a set of variants to determine the informative annotations. In this thesis, we developed two computational methods to facilitate the prediction and enrichment analysis of regulatory variants. We first used eQTL data from the Genotype-Tissue Expression (GTEx) project as training data, and comprehensively integrated tissue/cell type-specific epigenomic marks and prediction scores from existing tools as predictors to develop a regulatory variant prediction method cepip2. It was built on gradient tree boosting method and consisted of three submodels, a context-dependent model, an organism-level model and an overall model for different scenarios. Critical questions with respect to model constructions were carefully discussed and the constructed models were able to make accurate regulatory potential prediction in a tissue/cell type specific manner. Systematic comparison among submodels as well as existing methods were carried out on multiple independent test datasets, and cepip2 demonstrated superior performance most of the time. As an illustration of application, cepip2 was applied to fine-mapped GWAS summary data of 39 traits/diseases and was demonstrated to be capable of identifying the most relevant tissues/cell types. Random sampling of variants matching for selected properties is commonly used for null distribution construction in enrichment analysis and negative datasets generation in regulatory variant prediction. However, current tools are inefficient and unable to process large-scale input data. To tackle this problem, novel designs including a data structure with a corresponding index system and a sampling pipeline with a temporary storage algorithm were proposed in this thesis to develop a fast annotation-based matched variant sampling tool, vSampler. By careful benchmark tests, vSampler was shown to be much faster than existing tools and is robust to massive amount of input data. Its applications in enrichment analysis and advantage of having comprehensive matching properties were demonstrated in three usage examples. In conclusion, two novel computational methods were developed and evaluated in this thesis for the functional annotation, prioritization and enrichment analysis of human regulatory variants. We believe these methods would facilitate the precise interpretation of regulatory variants and their role in the development of complex traits.
Degree	Doctor of Philosophy
Subject	Human genetics - Variation Genomics - Data processing
Dept/Program	Biomedical Sciences
Persistent Identifier	http://hdl.handle.net/10722/301058

DC Field	Value	Language
dc.contributor.advisor	Sham, PC	-
dc.contributor.advisor	Xia, Z	-
dc.contributor.author	Yao, Hongcheng	-
dc.contributor.author	姚宏成	-
dc.date.accessioned	2021-07-16T14:38:44Z	-
dc.date.available	2021-07-16T14:38:44Z	-
dc.date.issued	2021	-
dc.identifier.citation	Yao, H. [姚宏成]. (2021). Functional annotation, prioritization and enrichment analysis of human regulatory variants. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/301058	-
dc.description.abstract	Regulatory variants are vital for regulation of gene expression and are involved in the disease pathogenesis and trait development. In the past decade, genome-wide association studies (GWASs) and expression quantitative trait loci (eQTL) studies have identified numerous associated variants and a large proportion of them are located in noncoding regions, indicating their potential role as regulatory variants. However, the precise identification and interpretation of noncoding regulatory variants by experimental validation are costly and labor-intensive, which hampers the illumination of the underlying mechanisms of diseases/traits. Fortunately, functional annotations like histone modification profiles can indicate the existence and potential function of regulatory variants and there is a surge of genomic, transcriptomic and epigenomic profiling studies across diverse tissues/cell types in recent years. As a result, variant annotation becomes a key step in the analysis of regulatory variants. Furthermore, based on the functional annotations, computational methods are developed to perform in silico prediction and prioritization of regulatory variants, while enrichment analysis is applied to a set of variants to determine the informative annotations. In this thesis, we developed two computational methods to facilitate the prediction and enrichment analysis of regulatory variants. We first used eQTL data from the Genotype-Tissue Expression (GTEx) project as training data, and comprehensively integrated tissue/cell type-specific epigenomic marks and prediction scores from existing tools as predictors to develop a regulatory variant prediction method cepip2. It was built on gradient tree boosting method and consisted of three submodels, a context-dependent model, an organism-level model and an overall model for different scenarios. Critical questions with respect to model constructions were carefully discussed and the constructed models were able to make accurate regulatory potential prediction in a tissue/cell type specific manner. Systematic comparison among submodels as well as existing methods were carried out on multiple independent test datasets, and cepip2 demonstrated superior performance most of the time. As an illustration of application, cepip2 was applied to fine-mapped GWAS summary data of 39 traits/diseases and was demonstrated to be capable of identifying the most relevant tissues/cell types. Random sampling of variants matching for selected properties is commonly used for null distribution construction in enrichment analysis and negative datasets generation in regulatory variant prediction. However, current tools are inefficient and unable to process large-scale input data. To tackle this problem, novel designs including a data structure with a corresponding index system and a sampling pipeline with a temporary storage algorithm were proposed in this thesis to develop a fast annotation-based matched variant sampling tool, vSampler. By careful benchmark tests, vSampler was shown to be much faster than existing tools and is robust to massive amount of input data. Its applications in enrichment analysis and advantage of having comprehensive matching properties were demonstrated in three usage examples. In conclusion, two novel computational methods were developed and evaluated in this thesis for the functional annotation, prioritization and enrichment analysis of human regulatory variants. We believe these methods would facilitate the precise interpretation of regulatory variants and their role in the development of complex traits.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Human genetics - Variation	-
dc.subject.lcsh	Genomics - Data processing	-
dc.title	Functional annotation, prioritization and enrichment analysis of human regulatory variants	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Biomedical Sciences	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2021	-
dc.identifier.mmsid	991044390191203414	-

File Download

Supplementary

postgraduate thesis: Functional annotation, prioritization and enrichment analysis of human regulatory variants

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats