File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Large genome de novo assembly with bi-directional BWT
Title | Large genome de novo assembly with bi-directional BWT |
---|---|
Authors | |
Issue Date | 2015 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Liu, B. [劉兵行]. (2015). Large genome de novo assembly with bi-directional BWT. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5736698 |
Abstract | De novo genome assembly is a fundamental problem in genomics research. When assembling large genomes, time is often a very important concern, and one might have no choice but to use a more efficient assembler like SOAPden-ovo2 instead of a high-quality but prohibitively slow assembler (e.g., SPAdes). Yet SOAPdenovo2 has inherent difficulty to utilize the full advantage of longer reads (say, 150bp to 250bp from Illumina HiSeq and MiSeq). Other assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are indeed more favorable for longer reads.
In this thesis, I mainly present a new contig assembler called BASE, based on a seed-extension approach. It exploits an efficient indexing of reads to generate adaptive seeds with high probability of unique appearance in the genome and high sequencing quality. Guided by these seeds, BASE constructs extension trees and gradually removes the branches with a method called reverse validation, which utilizes information about read coverage and paired-end relationship to obtain consensus sequences of reads sharing the seeds. These consensus sequences are further extended to form high quality contigs.
Benchmark on several bacteria and human datasets demonstrates the performance advantage of BASE in speed and assembly quality when longer reads are used. Our first benchmark was based on two datasets of deeply sequenced bacteria genomes (240X) with read length of 100bp and 250bp. Especially for 250bp reads, BASE performs much better than SOAPdenovo2 and SGA and is similar to SPAdes in performance. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. We have further compared BASE and SOAPdenovo2 using human genome datasets with read length 100bp, 150bp and 250bp. BASE consistently achieves a higher N50 for all datasets; while the improvement becomes more significant when read length reaches 250bp. SOAPdenovo2 uses relatively more memory when sequencing error is high.
BASE is an efficient assembler for contig construction, with significant improvement in quality for long NGS reads. It could be easily extended to support scaffolding in the near future. |
Degree | Master of Philosophy |
Subject | Nucleotide sequence - Data processing Data compression (Computer science) |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/225227 |
HKU Library Item ID | b5736698 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Liu, Binghang | - |
dc.contributor.author | 劉兵行 | - |
dc.date.accessioned | 2016-04-28T06:50:59Z | - |
dc.date.available | 2016-04-28T06:50:59Z | - |
dc.date.issued | 2015 | - |
dc.identifier.citation | Liu, B. [劉兵行]. (2015). Large genome de novo assembly with bi-directional BWT. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5736698 | - |
dc.identifier.uri | http://hdl.handle.net/10722/225227 | - |
dc.description.abstract | De novo genome assembly is a fundamental problem in genomics research. When assembling large genomes, time is often a very important concern, and one might have no choice but to use a more efficient assembler like SOAPden-ovo2 instead of a high-quality but prohibitively slow assembler (e.g., SPAdes). Yet SOAPdenovo2 has inherent difficulty to utilize the full advantage of longer reads (say, 150bp to 250bp from Illumina HiSeq and MiSeq). Other assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are indeed more favorable for longer reads. In this thesis, I mainly present a new contig assembler called BASE, based on a seed-extension approach. It exploits an efficient indexing of reads to generate adaptive seeds with high probability of unique appearance in the genome and high sequencing quality. Guided by these seeds, BASE constructs extension trees and gradually removes the branches with a method called reverse validation, which utilizes information about read coverage and paired-end relationship to obtain consensus sequences of reads sharing the seeds. These consensus sequences are further extended to form high quality contigs. Benchmark on several bacteria and human datasets demonstrates the performance advantage of BASE in speed and assembly quality when longer reads are used. Our first benchmark was based on two datasets of deeply sequenced bacteria genomes (240X) with read length of 100bp and 250bp. Especially for 250bp reads, BASE performs much better than SOAPdenovo2 and SGA and is similar to SPAdes in performance. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. We have further compared BASE and SOAPdenovo2 using human genome datasets with read length 100bp, 150bp and 250bp. BASE consistently achieves a higher N50 for all datasets; while the improvement becomes more significant when read length reaches 250bp. SOAPdenovo2 uses relatively more memory when sequencing error is high. BASE is an efficient assembler for contig construction, with significant improvement in quality for long NGS reads. It could be easily extended to support scaffolding in the near future. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Nucleotide sequence - Data processing | - |
dc.subject.lcsh | Data compression (Computer science) | - |
dc.title | Large genome de novo assembly with bi-directional BWT | - |
dc.type | PG_Thesis | - |
dc.identifier.hkul | b5736698 | - |
dc.description.thesisname | Master of Philosophy | - |
dc.description.thesislevel | Master | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.5353/th_b5736698 | - |
dc.identifier.mmsid | 991019348549703414 | - |