Big-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration

Yu, Guangchuang; 余光创

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_991044040577403414

Supplementary

Citations:
Appears in Collections:
- Public Health: Theses
- HKU Theses Online

postgraduate thesis: Big-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration

Title	Big-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration
Authors	Yu, Guangchuang 余光创
Issue Date	2017
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yu, G. [余光创]. (2017). Big-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	With the advance of next generation sequencing (NGS) technologies, genetic data produced for research are booming. Molecular evolution, the discipline that studies genetic changes over time and among organisms, has escalated alongside the rise of such ’big-data’. In the infectious disease research, studying molecular evolution of pathogens plays an important role to investigate disease origin, transmission and evolution in outbreaks, because the infection and transmission processes leave footprints on the pathogen’s genomes. While NGS technologies have enabled fast and massive acquisition of pathogen genome sequences, such big-data present many computational challenges to achieve evolutionary analyses with high efficiency and precision. This thesis research aims to provide new computational methods and tools to address some of these challenges. Phylogenetic tree (or phylogeny) is a fundamental framework for many analysis methods studying molecular evolution in different aspects and statistics, such as molecular clock and selective pressure inferences. Comparing the results from different analyses as well as with other phenotype data about the studied organisms obtained from experiments or other investigations, will likely generate more comprehensive understanding of the organisms and new hypothesis of genotype-phenotype association. A programmable platform for such data integration and analysis is needed for large data sets. Here, an R package treeio was developed to robustly import phylogenetic-related data from various analysis programs and sources. Another R package, ggtree, was developed to integrate these imported data for high-level analysis and efficient annotation of large complex phylogenetic trees. As genetic sequences are accumulating ever faster with NGS technologies, rebuilding large phylogenetic trees from scratch to include new sequences became inefficient as the evolutionary relationships of pre-existing sequences keep repeatedly calculated. TIPars was proposed to efficiently insert a new sequence to an existing tree using maximum parsimony criterion with pre-computed ancestral sequences. Simulation studies showed that TIPars had generally higher accuracy and speed compared with other existing maximum likelihood methods such as pplacer and EPA. Most popular NGS technologies generate short sequencing reads that require assembling into complete/longer biological sequences for downstream molecular evolution analyses. Conventional assembly methods have limitations to deal with the reads from the samples containing multiple strains of organisms, which are commonly observed in pathogen surveillance such as avian influenza A virus (AIV). This issue was addressed by a novel method (denoted as PAM) developed to utilize phylogeny to guide the genome assembly. PAM was shown capable to distinguish short sequencing reads from closely related pathogens such as co-infecting AIVs and hence assemble the genome sequences with improved accuracy and coverage, compared to other existing methods. This thesis developed several computational methods and tools to address issues in studying molecular evolution at big-data era, including genome assembly with strain-level resolution, updating the large phylogeny with new sequences, and data integration, analysis and annotation on large phylogeny. It is anticipated that these methods will facilitate the ’genomic surveillance’ of viral pathogens that involves joint analyses of the large amount of genetic sequences with their related epidemiological and virological data.
Degree	Doctor of Philosophy
Subject	Molecular evolution Genomics Big data
Dept/Program	Public Health
Persistent Identifier	http://hdl.handle.net/10722/261546

DC Field	Value	Language
dc.contributor.author	Yu, Guangchuang	-
dc.contributor.author	余光创	-
dc.date.accessioned	2018-09-20T06:44:12Z	-
dc.date.available	2018-09-20T06:44:12Z	-
dc.date.issued	2017	-
dc.identifier.citation	Yu, G. [余光创]. (2017). Big-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/261546	-
dc.description.abstract	With the advance of next generation sequencing (NGS) technologies, genetic data produced for research are booming. Molecular evolution, the discipline that studies genetic changes over time and among organisms, has escalated alongside the rise of such ’big-data’. In the infectious disease research, studying molecular evolution of pathogens plays an important role to investigate disease origin, transmission and evolution in outbreaks, because the infection and transmission processes leave footprints on the pathogen’s genomes. While NGS technologies have enabled fast and massive acquisition of pathogen genome sequences, such big-data present many computational challenges to achieve evolutionary analyses with high efficiency and precision. This thesis research aims to provide new computational methods and tools to address some of these challenges. Phylogenetic tree (or phylogeny) is a fundamental framework for many analysis methods studying molecular evolution in different aspects and statistics, such as molecular clock and selective pressure inferences. Comparing the results from different analyses as well as with other phenotype data about the studied organisms obtained from experiments or other investigations, will likely generate more comprehensive understanding of the organisms and new hypothesis of genotype-phenotype association. A programmable platform for such data integration and analysis is needed for large data sets. Here, an R package treeio was developed to robustly import phylogenetic-related data from various analysis programs and sources. Another R package, ggtree, was developed to integrate these imported data for high-level analysis and efficient annotation of large complex phylogenetic trees. As genetic sequences are accumulating ever faster with NGS technologies, rebuilding large phylogenetic trees from scratch to include new sequences became inefficient as the evolutionary relationships of pre-existing sequences keep repeatedly calculated. TIPars was proposed to efficiently insert a new sequence to an existing tree using maximum parsimony criterion with pre-computed ancestral sequences. Simulation studies showed that TIPars had generally higher accuracy and speed compared with other existing maximum likelihood methods such as pplacer and EPA. Most popular NGS technologies generate short sequencing reads that require assembling into complete/longer biological sequences for downstream molecular evolution analyses. Conventional assembly methods have limitations to deal with the reads from the samples containing multiple strains of organisms, which are commonly observed in pathogen surveillance such as avian influenza A virus (AIV). This issue was addressed by a novel method (denoted as PAM) developed to utilize phylogeny to guide the genome assembly. PAM was shown capable to distinguish short sequencing reads from closely related pathogens such as co-infecting AIVs and hence assemble the genome sequences with improved accuracy and coverage, compared to other existing methods. This thesis developed several computational methods and tools to address issues in studying molecular evolution at big-data era, including genome assembly with strain-level resolution, updating the large phylogeny with new sequences, and data integration, analysis and annotation on large phylogeny. It is anticipated that these methods will facilitate the ’genomic surveillance’ of viral pathogens that involves joint analyses of the large amount of genetic sequences with their related epidemiological and virological data.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Molecular evolution	-
dc.subject.lcsh	Genomics	-
dc.subject.lcsh	Big data	-
dc.title	Big-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Public Health	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_991044040577403414	-
dc.date.hkucongregation	2017	-
dc.identifier.mmsid	991044040577403414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Big-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats