File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Iterative de Bruijn graph assemblers for second-generation sequencing reads
Title | Iterative de Bruijn graph assemblers for second-generation sequencing reads |
---|---|
Authors | |
Advisors | Advisor(s):Chin, FYL |
Issue Date | 2012 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Peng, Y. [彭煜]. (2012). Iterative de Bruijn graph assemblers for second-generation sequencing reads. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5053405 |
Abstract | The recent advance of second-generation sequencing technologies has made it possible to generate a vast amount of short read sequences from a DNA (cDNA) sample. Current short read assemblers make use of the de Bruijn graph, in which each vertex is a k-mer and each edge connecting vertex u and vertex v represents u and v appearing in a read consecutively, to produce contigs. There are three major problems for de Bruijn graph assemblers: (1) branch problem, due to errors and repeats; (2) gap problem, due to low or uneven sequencing depth; and (3) error problem, due to sequencing errors. A proper choice of k value is a crucial tradeoff in de Bruijn graph assemblers: a low k value leads to fewer gaps but more branches; a high k value leads to fewer branches but more gaps.
In this thesis, I first analyze the fundamental genome assembly problem and then propose an iterative de Bruijn graph assembler (IDBA), which iterates from low to high k values, to construct a de Bruijn graph with fewer branches and fewer gaps than any other de Bruijn graph assembler using a fixed k value. Then, the second-generation sequencing data from metagenomic, single-cell and transcriptome samples is investigated. IDBA is then tailored with special treatments to handle the specific issues for each kind of data.
For metagenomic sequencing data, a graph partition algorithm is proposed to separate de Bruijn graph into dense components, which represent similar regions in subspecies from the same species, and multiple sequence alignment is used to produce consensus of each component. For sequencing data with highly uneven depth such as single-cell and metagenomic sequencing data, a method called local assembly is designed to reconstruct missing k-mers in low-depth regions. Then, based on the observation that short and relatively low-depth contigs are more likely erroneous, progressive depth on contigs is used to remove errors in both low-depth and high-depth regions iteratively. For transcriptome sequencing data, a variant of the progressive depth method is adopted to decompose the de Bruijn graph into components corresponding to transcripts from the same gene, and then the transcripts are found in each component by considering the reads and paired-end reads support.
Plenty of experiments on both simulated and real data show that IDBA assemblers outperform the existing assemblers by constructing longer contigs with higher completeness and similar or better accuracy. The running time of IDBA assemblers is comparable to existing algorithms, while the memory cost is usually less than the others. |
Degree | Doctor of Philosophy |
Subject | Nucleotide sequence - Data processing. Sequence alignment (Bioinformatics) |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/188286 |
HKU Library Item ID | b5053405 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Chin, FYL | - |
dc.contributor.author | Peng, Yu | - |
dc.contributor.author | 彭煜 | - |
dc.date.accessioned | 2013-08-27T08:03:15Z | - |
dc.date.available | 2013-08-27T08:03:15Z | - |
dc.date.issued | 2012 | - |
dc.identifier.citation | Peng, Y. [彭煜]. (2012). Iterative de Bruijn graph assemblers for second-generation sequencing reads. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5053405 | - |
dc.identifier.uri | http://hdl.handle.net/10722/188286 | - |
dc.description.abstract | The recent advance of second-generation sequencing technologies has made it possible to generate a vast amount of short read sequences from a DNA (cDNA) sample. Current short read assemblers make use of the de Bruijn graph, in which each vertex is a k-mer and each edge connecting vertex u and vertex v represents u and v appearing in a read consecutively, to produce contigs. There are three major problems for de Bruijn graph assemblers: (1) branch problem, due to errors and repeats; (2) gap problem, due to low or uneven sequencing depth; and (3) error problem, due to sequencing errors. A proper choice of k value is a crucial tradeoff in de Bruijn graph assemblers: a low k value leads to fewer gaps but more branches; a high k value leads to fewer branches but more gaps. In this thesis, I first analyze the fundamental genome assembly problem and then propose an iterative de Bruijn graph assembler (IDBA), which iterates from low to high k values, to construct a de Bruijn graph with fewer branches and fewer gaps than any other de Bruijn graph assembler using a fixed k value. Then, the second-generation sequencing data from metagenomic, single-cell and transcriptome samples is investigated. IDBA is then tailored with special treatments to handle the specific issues for each kind of data. For metagenomic sequencing data, a graph partition algorithm is proposed to separate de Bruijn graph into dense components, which represent similar regions in subspecies from the same species, and multiple sequence alignment is used to produce consensus of each component. For sequencing data with highly uneven depth such as single-cell and metagenomic sequencing data, a method called local assembly is designed to reconstruct missing k-mers in low-depth regions. Then, based on the observation that short and relatively low-depth contigs are more likely erroneous, progressive depth on contigs is used to remove errors in both low-depth and high-depth regions iteratively. For transcriptome sequencing data, a variant of the progressive depth method is adopted to decompose the de Bruijn graph into components corresponding to transcripts from the same gene, and then the transcripts are found in each component by considering the reads and paired-end reads support. Plenty of experiments on both simulated and real data show that IDBA assemblers outperform the existing assemblers by constructing longer contigs with higher completeness and similar or better accuracy. The running time of IDBA assemblers is comparable to existing algorithms, while the memory cost is usually less than the others. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.source.uri | http://hub.hku.hk/bib/B50534051 | - |
dc.subject.lcsh | Nucleotide sequence - Data processing. | - |
dc.subject.lcsh | Sequence alignment (Bioinformatics) | - |
dc.title | Iterative de Bruijn graph assemblers for second-generation sequencing reads | - |
dc.type | PG_Thesis | - |
dc.identifier.hkul | b5053405 | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.5353/th_b5053405 | - |
dc.date.hkucongregation | 2013 | - |
dc.identifier.mmsid | 991035480719703414 | - |