MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

LI, D; Luo, R; Liu, CM; Leung, HCM; Ting, HF; Sadakane, KUNIHIKO; Yamashita, HIROSHI; Lam, TW

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1016/j.ymeth.2016.02.020
Scopus: eid_2-s2.0-84962227778
PMID: 27012178
WOS: WOS:000377316200002
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
- PubMed Central: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

Title	MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices
Authors	LI, D Luo, R Liu, CM Leung, HCM Ting, HF Sadakane, KUNIHIKO Yamashita, HIROSHI Lam, TW
Keywords	Metagenome assembly Parallel computing Succinct data structure
Issue Date	2016
Publisher	Academic Press. The Journal's web site is located at http://www.elsevier.com/locate/ymeth
Citation	Methods, 2016, v. 102, p. 3-11 How to Cite? DOI: http://dx.doi.org/10.1016/j.ymeth.2016.02.020
Abstract	The study of metagenomics has been much benefited from low-cost and high-throughput sequencing technologies, yet the tremendous amount of data generated make analysis like de novo assembly to consume too much computational resources. In late 2014 we released MEGAHIT v0.1 (together with a brief note of Li et al. (2015) [1]), which is the first NGS metagenome assembler that can assemble genome sequences from metagenomic datasets of hundreds of Giga base-pairs (bp) in a time- and memory-efficient manner on a single server. The core of MEGAHIT is an efficient parallel algorithm for constructing succinct de Bruijn Graphs (SdBG), implemented on a graphical processing unit (GPU). The software has been well received by the assembly community, and there is interest in how to adapt the algorithms to integrate popular assembly practices so as to improve the assembly quality, as well as how to speed up the software using better CPU-based algorithms (instead of GPU).In this paper we first describe the details of the core algorithms in MEGAHIT v0.1, and then we show the new modules to upgrade MEGAHIT to version v1.0, which gives better assembly quality, runs faster and uses less memory. For the Iowa Prairie Soil dataset (252 Gbp after quality trimming), the assembly quality of MEGAHIT v1.0, when compared with v0.1, has a significant improvement, namely, 36% increase in assembly size and 23% in N50. More interestingly, MEGAHIT v1.0 is no slower than before (even running with the extra modules). This is primarily due to a new CPU-based algorithm for SdBG construction that is faster and requires less memory. Using CPU only, MEGAHIT v1.0 can assemble the Iowa Prairie Soil sample in about 43 h, reducing the running time of v0.1 by at least 25% and memory usage by up to 50%. MEGAHIT v1.0, exhibiting a smaller memory footprint, can process even larger datasets. The Kansas Prairie Soil sample (484 Gbp), the largest publicly availa ble dataset, can now be assembled using no more than 500 GB of memory in 7.5 days. The assemblies of these datasets (and other large metgenomic datasets), as well as the software, are available at the website https://hku-bal.github.io/megabox.
Persistent Identifier	http://hdl.handle.net/10722/230205
ISSN	1046-2023 2023 Impact Factor: 4.2 2023 SCImago Journal Rankings: 1.162
ISI Accession Number ID	WOS:000377316200002

DC Field	Value	Language
dc.contributor.author	LI, D	-
dc.contributor.author	Luo, R	-
dc.contributor.author	Liu, CM	-
dc.contributor.author	Leung, HCM	-
dc.contributor.author	Ting, HF	-
dc.contributor.author	Sadakane, KUNIHIKO	-
dc.contributor.author	Yamashita, HIROSHI	-
dc.contributor.author	Lam, TW	-
dc.date.accessioned	2016-08-23T14:15:43Z	-
dc.date.available	2016-08-23T14:15:43Z	-
dc.date.issued	2016	-
dc.identifier.citation	Methods, 2016, v. 102, p. 3-11	-
dc.identifier.issn	1046-2023	-
dc.identifier.uri	http://hdl.handle.net/10722/230205	-
dc.description.abstract	The study of metagenomics has been much benefited from low-cost and high-throughput sequencing technologies, yet the tremendous amount of data generated make analysis like de novo assembly to consume too much computational resources. In late 2014 we released MEGAHIT v0.1 (together with a brief note of Li et al. (2015) [1]), which is the first NGS metagenome assembler that can assemble genome sequences from metagenomic datasets of hundreds of Giga base-pairs (bp) in a time- and memory-efficient manner on a single server. The core of MEGAHIT is an efficient parallel algorithm for constructing succinct de Bruijn Graphs (SdBG), implemented on a graphical processing unit (GPU). The software has been well received by the assembly community, and there is interest in how to adapt the algorithms to integrate popular assembly practices so as to improve the assembly quality, as well as how to speed up the software using better CPU-based algorithms (instead of GPU).In this paper we first describe the details of the core algorithms in MEGAHIT v0.1, and then we show the new modules to upgrade MEGAHIT to version v1.0, which gives better assembly quality, runs faster and uses less memory. For the Iowa Prairie Soil dataset (252 Gbp after quality trimming), the assembly quality of MEGAHIT v1.0, when compared with v0.1, has a significant improvement, namely, 36% increase in assembly size and 23% in N50. More interestingly, MEGAHIT v1.0 is no slower than before (even running with the extra modules). This is primarily due to a new CPU-based algorithm for SdBG construction that is faster and requires less memory. Using CPU only, MEGAHIT v1.0 can assemble the Iowa Prairie Soil sample in about 43 h, reducing the running time of v0.1 by at least 25% and memory usage by up to 50%. MEGAHIT v1.0, exhibiting a smaller memory footprint, can process even larger datasets. The Kansas Prairie Soil sample (484 Gbp), the largest publicly availa ble dataset, can now be assembled using no more than 500 GB of memory in 7.5 days. The assemblies of these datasets (and other large metgenomic datasets), as well as the software, are available at the website https://hku-bal.github.io/megabox.	-
dc.language	eng	-
dc.publisher	Academic Press. The Journal's web site is located at http://www.elsevier.com/locate/ymeth	-
dc.relation.ispartof	Methods	-
dc.rights	Posting accepted manuscript (postprint): © <year>. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/	-
dc.subject	Metagenome assembly	-
dc.subject	Parallel computing	-
dc.subject	Succinct data structure	-
dc.title	MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices	-
dc.type	Article	-
dc.identifier.email	Luo, R: rbluo@hku.hk	-
dc.identifier.email	Leung, HCM: cmleung2@cs.hku.hk	-
dc.identifier.email	Ting, HF: hfting@cs.hku.hk	-
dc.identifier.email	Lam, TW: twlam@cs.hku.hk	-
dc.identifier.authority	Leung, HCM=rp00144	-
dc.identifier.authority	Ting, HF=rp00177	-
dc.identifier.authority	Lam, TW=rp00135	-
dc.identifier.doi	10.1016/j.ymeth.2016.02.020	-
dc.identifier.pmid	27012178	-
dc.identifier.scopus	eid_2-s2.0-84962227778	-
dc.identifier.hkuros	260189	-
dc.identifier.hkuros	310903	-
dc.identifier.volume	102	-
dc.identifier.spage	3	-
dc.identifier.epage	11	-
dc.identifier.isi	WOS:000377316200002	-
dc.publisher.place	United States	-
dc.identifier.issnl	1046-2023	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats