MegaGTA: A sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs

Li, D; Huang, Y; Leung, HCM; Luo, R; Ting, HF; Lam, TW

File Download

Content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1186/s12859-017-1825-3
Scopus: eid_2-s2.0-85031494680
WOS: WOS:000413649500008

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: MegaGTA: A sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs

Title	MegaGTA: A sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs
Authors	Li, D Huang, Y Leung, HCM Luo, R Ting, HF Lam, TW
Keywords	De Bruijn graph Targeted gene Metagenomics Assembly
Issue Date	2017
Citation	12th International Symposium on Bioinformatics Research and Applications (ISBRA 2016) Minsk, Belarus. 5-8 June 2016. In BMC Bioinformatics, 2017, v. 18 n. S-12, p. 67-75 How to Cite? DOI: http://dx.doi.org/10.1186/s12859-017-1825-3
Abstract	Background: The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 k-mer size for graph construction and whatever choice of k will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of de Bruijn graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of k-mers, which would have been an effective way to differentiate between erroneous k-mers and correct k-mers. Results: In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative de Bruijn graphs to take full advantage of multiple k-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct de Bruijn graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a de Bruijn graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of k-mers for building better HMM model. We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7-19.3% more contigs than Xander, and these contigs were assigned to 10-25% more gene references. In our experiments, MegaGTA, depending on the number of k-mers used, is two to ten times faster than Xander. Conclusion: MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at https://github.com/HKU-BAL/megagta.
Persistent Identifier	http://hdl.handle.net/10722/251248
ISI Accession Number ID	WOS:000413649500008

DC Field	Value	Language
dc.contributor.author	Li, D	-
dc.contributor.author	Huang, Y	-
dc.contributor.author	Leung, HCM	-
dc.contributor.author	Luo, R	-
dc.contributor.author	Ting, HF	-
dc.contributor.author	Lam, TW	-
dc.date.accessioned	2018-02-01T01:55:01Z	-
dc.date.available	2018-02-01T01:55:01Z	-
dc.date.issued	2017	-
dc.identifier.citation	12th International Symposium on Bioinformatics Research and Applications (ISBRA 2016) Minsk, Belarus. 5-8 June 2016. In BMC Bioinformatics, 2017, v. 18 n. S-12, p. 67-75	-
dc.identifier.uri	http://hdl.handle.net/10722/251248	-
dc.description.abstract	Background: The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 k-mer size for graph construction and whatever choice of k will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of de Bruijn graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of k-mers, which would have been an effective way to differentiate between erroneous k-mers and correct k-mers. Results: In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative de Bruijn graphs to take full advantage of multiple k-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct de Bruijn graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a de Bruijn graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of k-mers for building better HMM model. We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7-19.3% more contigs than Xander, and these contigs were assigned to 10-25% more gene references. In our experiments, MegaGTA, depending on the number of k-mers used, is two to ten times faster than Xander. Conclusion: MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at https://github.com/HKU-BAL/megagta.	-
dc.language	eng	-
dc.relation.ispartof	BMC Bioinformatics	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	De Bruijn graph	-
dc.subject	Targeted gene	-
dc.subject	Metagenomics	-
dc.subject	Assembly	-
dc.title	MegaGTA: A sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs	-
dc.type	Conference_Paper	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.1186/s12859-017-1825-3	-
dc.identifier.scopus	eid_2-s2.0-85031494680	-
dc.identifier.hkuros	290698	-
dc.identifier.volume	18	-
dc.identifier.issue	Suppl. 12	-
dc.identifier.spage	67	-
dc.identifier.epage	75	-
dc.identifier.eissn	1471-2105	-
dc.identifier.isi	WOS:000413649500008	-
dc.identifier.issnl	1471-2105	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: MegaGTA: A sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats