File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

postgraduate thesis: Memory- and time-efficient solutions for large-scale metagenomic sequence analysis

TitleMemory- and time-efficient solutions for large-scale metagenomic sequence analysis
Authors
Advisors
Advisor(s):Lam, TW
Issue Date2017
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Li, D. R. [李定华]. (2017). Memory- and time-efficient solutions for large-scale metagenomic sequence analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractMetagenomics, the study of genomic material directly obtained from uncultured environments, has greatly benefited from the recent advances in next-generation sequencing (NGS) technologies. These technologies are capable of generating millions to billions of short DNA segments (known as reads) from one or multiple environmental samples in a few days. The large volume and high complexity of metagenomic data have posed new computational challenges in analyzing them efficiently and accurately. This thesis discusses three software solutions for analyzing large-scale metagenomic NGS data in memory- and time-efficient manners. All of them were tested on a wide-range of metagenomic data to show their flexibility and advantage over existing methods. The first tool introduced is MEGAHIT, a de novo assembler for NGS metagenomic sequences. It is the first tool to exploit a succinct data structure called succinct de Bruijn graph to achieve low memory footprint, and it simultaneously achieves high speed via sophisticated parallel design to manipulate the succinct data structure. Before MEGAHIT, the Great Prairie Soil Metagenome datasets (52Gbp to 597Gbp) could only be assembled with preprocessing including partitioning and digital normalization. MEGAHIT is the first metagenome assembler that could handle them without preprocessing and more importantly, delivers higher-quality results in terms of assembly completeness and contiguity. The second tool to present is MegaGTA, a metagenomic gene-targeted assembler. The idea is to make use of existing gene information to improve the quality of metagenome assembly, especially for high-complexity datasets. MegaGTA improves the pioneering work Xander in three aspects to fully demonstrate the power of gene-targeted assembly. First, MegaGTA employs iterative de Bruijn graphs to achieve high sensitivity and accuracy simultaneously. Second, it penalizes error-prone nodes in de Bruijn graphs to reduce assembly errors. Third, it utilizes succinct de Bruijn graphs to replace Bloom-filters, which are inexact representations of the graphs used in Xander. MegaGTA outperforms Xander in both mock and real metagenomic datasets and is much faster. MegaGTA can assemble large soil metagenome datasets and produce longer and a greater number of gene sequences than MEGAHIT. Lastly, I present MegaPath, a bioinformatics pipeline for metagenomic short read classification. Unlike most metagenomic classifiers that use exact matches to trade sensitivity for high speed, MegaPath is a more sensitive solution powered by a new NGS short read aligner SOAP-M. This aligner adopts a refined maximum mappable prefix seeding strategy using FM-index and an SIMD-enabled implementation of Smith-Waterman alignment. MegaPath shows higher sensitivity than popular classifiers like Kraken and Centrifuge in different kinds of datasets, especially when the microorganisms in the metagenomic samples share low identity with the reference genomes. MegaPath also allows a more refined classification at the protein level, which endows it the ability to identify highly mutated viral species. Even at the protein level, MegaPath detects bacteria and viruses from clinical samples that typically contain tens of millions of short reads in two to three hours using a single server. MegaPath’s high sensitivity and speed demonstrate its distinguishing feasibility for pathogen detection in real clinical cases.
DegreeDoctor of Philosophy
SubjectMetagenomics
Bioinformatics
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/249831

 

DC FieldValueLanguage
dc.contributor.advisorLam, TW-
dc.contributor.authorLi, Dinghua-
dc.contributor.author李定华-
dc.date.accessioned2017-12-19T09:27:27Z-
dc.date.available2017-12-19T09:27:27Z-
dc.date.issued2017-
dc.identifier.citationLi, D. R. [李定华]. (2017). Memory- and time-efficient solutions for large-scale metagenomic sequence analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/249831-
dc.description.abstractMetagenomics, the study of genomic material directly obtained from uncultured environments, has greatly benefited from the recent advances in next-generation sequencing (NGS) technologies. These technologies are capable of generating millions to billions of short DNA segments (known as reads) from one or multiple environmental samples in a few days. The large volume and high complexity of metagenomic data have posed new computational challenges in analyzing them efficiently and accurately. This thesis discusses three software solutions for analyzing large-scale metagenomic NGS data in memory- and time-efficient manners. All of them were tested on a wide-range of metagenomic data to show their flexibility and advantage over existing methods. The first tool introduced is MEGAHIT, a de novo assembler for NGS metagenomic sequences. It is the first tool to exploit a succinct data structure called succinct de Bruijn graph to achieve low memory footprint, and it simultaneously achieves high speed via sophisticated parallel design to manipulate the succinct data structure. Before MEGAHIT, the Great Prairie Soil Metagenome datasets (52Gbp to 597Gbp) could only be assembled with preprocessing including partitioning and digital normalization. MEGAHIT is the first metagenome assembler that could handle them without preprocessing and more importantly, delivers higher-quality results in terms of assembly completeness and contiguity. The second tool to present is MegaGTA, a metagenomic gene-targeted assembler. The idea is to make use of existing gene information to improve the quality of metagenome assembly, especially for high-complexity datasets. MegaGTA improves the pioneering work Xander in three aspects to fully demonstrate the power of gene-targeted assembly. First, MegaGTA employs iterative de Bruijn graphs to achieve high sensitivity and accuracy simultaneously. Second, it penalizes error-prone nodes in de Bruijn graphs to reduce assembly errors. Third, it utilizes succinct de Bruijn graphs to replace Bloom-filters, which are inexact representations of the graphs used in Xander. MegaGTA outperforms Xander in both mock and real metagenomic datasets and is much faster. MegaGTA can assemble large soil metagenome datasets and produce longer and a greater number of gene sequences than MEGAHIT. Lastly, I present MegaPath, a bioinformatics pipeline for metagenomic short read classification. Unlike most metagenomic classifiers that use exact matches to trade sensitivity for high speed, MegaPath is a more sensitive solution powered by a new NGS short read aligner SOAP-M. This aligner adopts a refined maximum mappable prefix seeding strategy using FM-index and an SIMD-enabled implementation of Smith-Waterman alignment. MegaPath shows higher sensitivity than popular classifiers like Kraken and Centrifuge in different kinds of datasets, especially when the microorganisms in the metagenomic samples share low identity with the reference genomes. MegaPath also allows a more refined classification at the protein level, which endows it the ability to identify highly mutated viral species. Even at the protein level, MegaPath detects bacteria and viruses from clinical samples that typically contain tens of millions of short reads in two to three hours using a single server. MegaPath’s high sensitivity and speed demonstrate its distinguishing feasibility for pathogen detection in real clinical cases.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshMetagenomics-
dc.subject.lcshBioinformatics-
dc.titleMemory- and time-efficient solutions for large-scale metagenomic sequence analysis-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.5353/th_991043976596703414-
dc.date.hkucongregation2017-
dc.identifier.mmsid991043976596703414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats