Binning and annotation for metagenomic next-generation sequencing reads

Wang, Yi; 王毅

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_b5351027

Supplementary

Citations:
Appears in Collections:
- Computer Science: Theses
- HKU Theses Online

postgraduate thesis: Binning and annotation for metagenomic next-generation sequencing reads

Title	Binning and annotation for metagenomic next-generation sequencing reads
Authors	Wang, Yi 王毅
Advisors	Advisor(s):Chin, FYL
Issue Date	2014
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Wang, Y. [王毅]. (2014). Binning and annotation for metagenomic next-generation sequencing reads. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5351027
Abstract	The development of next-generation sequencing technology enables us to obtain a vast number of short reads from metagenomic samples. In metagenomic samples, the reads from different species are mixed together. So, metagenomic binning has been introduced to cluster reads from the same or closely related species and metagenomic annotation is introduced to predict the taxonomic information of each read. Both metagenomic binning and annotation are critical steps in downstream analysis. This thesis discusses the difficulties of these two computational problems and proposes two algorithmic methods, MetaCluster 5.0 and MetaAnnotator, as solutions. There are six major challenges in metagenomic binning: (1) the lack of reference genomes; (2) uneven abundance ratios; (3) short read lengths; (4) a large number of species; (5) the existence of species with extremely-low-abundance; and (6) recovering low-abundance species. To solve these problems, I propose a two-round binning method, MetaCluster 5.0. The improvement achieved by MetaCluster 5.0 is based on three major observations. First, the short q-mer (length-q substring of the sequence with q = 4, 5) frequency distributions of individual sufficiently long fragments sampled from the same genome are more similar than those sampled from different genomes. Second, sufficiently long w-mers (length-w substring of the sequence with w ≈ 30) are usually unique in each individual genome. Third, the k-mer (length-k substring of the sequence with k ≈ 16) frequencies from reads of a species are usually linearly proportional to that of the species’ abundance. The metagenomic annotation methods in the literatures often suffer from five major drawbacks: (1) unable to annotate many reads; (2) less precise annotation for reads and more incorrect annotation for contigs; (3) unable to deal with novel clades with limited references genomes well; (4) performance affected by variable genome sequence similarities between different clades; and (5) high time complexity. In this thesis, a novel tool, MetaAnnotator, is proposed to tackle these problems. There are four major contributions of MetaAnnotator. Firstly, instead of annotating reads/contigs independently, a cluster of reads/contigs are annotated as a whole. Secondly, multiple reference databases are integrated. Thirdly, for each individual clade, quadratic discriminant analysis is applied to capture the similarities between reference sequences in the clade. Fourthly, instead of using alignment tools, MetaAnnotator perform annotation using k-mer exact match which is more efficient. Experiments on both simulated datasets and real datasets show that MetaCluster 5.0 and MetaAnnotator outperform existing tools with higher accuracy as well as less time and space cost.
Degree	Doctor of Philosophy
Subject	Nucleotide sequence - Data processing
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/208040
HKU Library Item ID	b5351027

DC Field	Value	Language
dc.contributor.advisor	Chin, FYL	-
dc.contributor.author	Wang, Yi	-
dc.contributor.author	王毅	-
dc.date.accessioned	2015-02-06T14:19:36Z	-
dc.date.available	2015-02-06T14:19:36Z	-
dc.date.issued	2014	-
dc.identifier.citation	Wang, Y. [王毅]. (2014). Binning and annotation for metagenomic next-generation sequencing reads. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5351027	-
dc.identifier.uri	http://hdl.handle.net/10722/208040	-
dc.description.abstract	The development of next-generation sequencing technology enables us to obtain a vast number of short reads from metagenomic samples. In metagenomic samples, the reads from different species are mixed together. So, metagenomic binning has been introduced to cluster reads from the same or closely related species and metagenomic annotation is introduced to predict the taxonomic information of each read. Both metagenomic binning and annotation are critical steps in downstream analysis. This thesis discusses the difficulties of these two computational problems and proposes two algorithmic methods, MetaCluster 5.0 and MetaAnnotator, as solutions. There are six major challenges in metagenomic binning: (1) the lack of reference genomes; (2) uneven abundance ratios; (3) short read lengths; (4) a large number of species; (5) the existence of species with extremely-low-abundance; and (6) recovering low-abundance species. To solve these problems, I propose a two-round binning method, MetaCluster 5.0. The improvement achieved by MetaCluster 5.0 is based on three major observations. First, the short q-mer (length-q substring of the sequence with q = 4, 5) frequency distributions of individual sufficiently long fragments sampled from the same genome are more similar than those sampled from different genomes. Second, sufficiently long w-mers (length-w substring of the sequence with w ≈ 30) are usually unique in each individual genome. Third, the k-mer (length-k substring of the sequence with k ≈ 16) frequencies from reads of a species are usually linearly proportional to that of the species’ abundance. The metagenomic annotation methods in the literatures often suffer from five major drawbacks: (1) unable to annotate many reads; (2) less precise annotation for reads and more incorrect annotation for contigs; (3) unable to deal with novel clades with limited references genomes well; (4) performance affected by variable genome sequence similarities between different clades; and (5) high time complexity. In this thesis, a novel tool, MetaAnnotator, is proposed to tackle these problems. There are four major contributions of MetaAnnotator. Firstly, instead of annotating reads/contigs independently, a cluster of reads/contigs are annotated as a whole. Secondly, multiple reference databases are integrated. Thirdly, for each individual clade, quadratic discriminant analysis is applied to capture the similarities between reference sequences in the clade. Fourthly, instead of using alignment tools, MetaAnnotator perform annotation using k-mer exact match which is more efficient. Experiments on both simulated datasets and real datasets show that MetaCluster 5.0 and MetaAnnotator outperform existing tools with higher accuracy as well as less time and space cost.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Nucleotide sequence - Data processing	-
dc.title	Binning and annotation for metagenomic next-generation sequencing reads	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5351027	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_b5351027	-
dc.identifier.mmsid	991040122969703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Binning and annotation for metagenomic next-generation sequencing reads

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats