Large genome de novo assembly with bi-directional BWT

Liu, Binghang; 劉兵行

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_b5736698

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Large genome de novo assembly with bi-directional BWT

Title	Large genome de novo assembly with bi-directional BWT
Authors	Liu, Binghang 劉兵行
Issue Date	2015
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Liu, B. [劉兵行]. (2015). Large genome de novo assembly with bi-directional BWT. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5736698
Abstract	De novo genome assembly is a fundamental problem in genomics research. When assembling large genomes, time is often a very important concern, and one might have no choice but to use a more efficient assembler like SOAPden-ovo2 instead of a high-quality but prohibitively slow assembler (e.g., SPAdes). Yet SOAPdenovo2 has inherent difficulty to utilize the full advantage of longer reads (say, 150bp to 250bp from Illumina HiSeq and MiSeq). Other assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are indeed more favorable for longer reads. In this thesis, I mainly present a new contig assembler called BASE, based on a seed-extension approach. It exploits an efficient indexing of reads to generate adaptive seeds with high probability of unique appearance in the genome and high sequencing quality. Guided by these seeds, BASE constructs extension trees and gradually removes the branches with a method called reverse validation, which utilizes information about read coverage and paired-end relationship to obtain consensus sequences of reads sharing the seeds. These consensus sequences are further extended to form high quality contigs. Benchmark on several bacteria and human datasets demonstrates the performance advantage of BASE in speed and assembly quality when longer reads are used. Our first benchmark was based on two datasets of deeply sequenced bacteria genomes (240X) with read length of 100bp and 250bp. Especially for 250bp reads, BASE performs much better than SOAPdenovo2 and SGA and is similar to SPAdes in performance. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. We have further compared BASE and SOAPdenovo2 using human genome datasets with read length 100bp, 150bp and 250bp. BASE consistently achieves a higher N50 for all datasets; while the improvement becomes more significant when read length reaches 250bp. SOAPdenovo2 uses relatively more memory when sequencing error is high. BASE is an efficient assembler for contig construction, with significant improvement in quality for long NGS reads. It could be easily extended to support scaffolding in the near future.
Degree	Master of Philosophy
Subject	Nucleotide sequence - Data processing Data compression (Computer science)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/225227
HKU Library Item ID	b5736698

DC Field	Value	Language
dc.contributor.author	Liu, Binghang	-
dc.contributor.author	劉兵行	-
dc.date.accessioned	2016-04-28T06:50:59Z	-
dc.date.available	2016-04-28T06:50:59Z	-
dc.date.issued	2015	-
dc.identifier.citation	Liu, B. [劉兵行]. (2015). Large genome de novo assembly with bi-directional BWT. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5736698	-
dc.identifier.uri	http://hdl.handle.net/10722/225227	-
dc.description.abstract	De novo genome assembly is a fundamental problem in genomics research. When assembling large genomes, time is often a very important concern, and one might have no choice but to use a more efficient assembler like SOAPden-ovo2 instead of a high-quality but prohibitively slow assembler (e.g., SPAdes). Yet SOAPdenovo2 has inherent difficulty to utilize the full advantage of longer reads (say, 150bp to 250bp from Illumina HiSeq and MiSeq). Other assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are indeed more favorable for longer reads. In this thesis, I mainly present a new contig assembler called BASE, based on a seed-extension approach. It exploits an efficient indexing of reads to generate adaptive seeds with high probability of unique appearance in the genome and high sequencing quality. Guided by these seeds, BASE constructs extension trees and gradually removes the branches with a method called reverse validation, which utilizes information about read coverage and paired-end relationship to obtain consensus sequences of reads sharing the seeds. These consensus sequences are further extended to form high quality contigs. Benchmark on several bacteria and human datasets demonstrates the performance advantage of BASE in speed and assembly quality when longer reads are used. Our first benchmark was based on two datasets of deeply sequenced bacteria genomes (240X) with read length of 100bp and 250bp. Especially for 250bp reads, BASE performs much better than SOAPdenovo2 and SGA and is similar to SPAdes in performance. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. We have further compared BASE and SOAPdenovo2 using human genome datasets with read length 100bp, 150bp and 250bp. BASE consistently achieves a higher N50 for all datasets; while the improvement becomes more significant when read length reaches 250bp. SOAPdenovo2 uses relatively more memory when sequencing error is high. BASE is an efficient assembler for contig construction, with significant improvement in quality for long NGS reads. It could be easily extended to support scaffolding in the near future.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Nucleotide sequence - Data processing	-
dc.subject.lcsh	Data compression (Computer science)	-
dc.title	Large genome de novo assembly with bi-directional BWT	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5736698	-
dc.description.thesisname	Master of Philosophy	-
dc.description.thesislevel	Master	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_b5736698	-
dc.identifier.mmsid	991019348549703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Large genome de novo assembly with bi-directional BWT

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats