Iterative de Bruijn graph assemblers for second-generation sequencing reads

Peng, Yu; 彭煜

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_b5053405

Supplementary

Citations:
Appears in Collections:
- Computer Science & Information Systems: Theses
- HKU Theses Online

postgraduate thesis: Iterative de Bruijn graph assemblers for second-generation sequencing reads

Title	Iterative de Bruijn graph assemblers for second-generation sequencing reads
Authors	Peng, Yu 彭煜
Advisors	Advisor(s):Chin, FYL
Issue Date	2012
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Peng, Y. [彭煜]. (2012). Iterative de Bruijn graph assemblers for second-generation sequencing reads. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5053405
Abstract	The recent advance of second-generation sequencing technologies has made it possible to generate a vast amount of short read sequences from a DNA (cDNA) sample. Current short read assemblers make use of the de Bruijn graph, in which each vertex is a k-mer and each edge connecting vertex u and vertex v represents u and v appearing in a read consecutively, to produce contigs. There are three major problems for de Bruijn graph assemblers: (1) branch problem, due to errors and repeats; (2) gap problem, due to low or uneven sequencing depth; and (3) error problem, due to sequencing errors. A proper choice of k value is a crucial tradeoff in de Bruijn graph assemblers: a low k value leads to fewer gaps but more branches; a high k value leads to fewer branches but more gaps. In this thesis, I first analyze the fundamental genome assembly problem and then propose an iterative de Bruijn graph assembler (IDBA), which iterates from low to high k values, to construct a de Bruijn graph with fewer branches and fewer gaps than any other de Bruijn graph assembler using a fixed k value. Then, the second-generation sequencing data from metagenomic, single-cell and transcriptome samples is investigated. IDBA is then tailored with special treatments to handle the specific issues for each kind of data. For metagenomic sequencing data, a graph partition algorithm is proposed to separate de Bruijn graph into dense components, which represent similar regions in subspecies from the same species, and multiple sequence alignment is used to produce consensus of each component. For sequencing data with highly uneven depth such as single-cell and metagenomic sequencing data, a method called local assembly is designed to reconstruct missing k-mers in low-depth regions. Then, based on the observation that short and relatively low-depth contigs are more likely erroneous, progressive depth on contigs is used to remove errors in both low-depth and high-depth regions iteratively. For transcriptome sequencing data, a variant of the progressive depth method is adopted to decompose the de Bruijn graph into components corresponding to transcripts from the same gene, and then the transcripts are found in each component by considering the reads and paired-end reads support. Plenty of experiments on both simulated and real data show that IDBA assemblers outperform the existing assemblers by constructing longer contigs with higher completeness and similar or better accuracy. The running time of IDBA assemblers is comparable to existing algorithms, while the memory cost is usually less than the others.
Degree	Doctor of Philosophy
Subject	Nucleotide sequence - Data processing. Sequence alignment (Bioinformatics)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/188286
HKU Library Item ID	b5053405

DC Field	Value	Language
dc.contributor.advisor	Chin, FYL	-
dc.contributor.author	Peng, Yu	-
dc.contributor.author	彭煜	-
dc.date.accessioned	2013-08-27T08:03:15Z	-
dc.date.available	2013-08-27T08:03:15Z	-
dc.date.issued	2012	-
dc.identifier.citation	Peng, Y. [彭煜]. (2012). Iterative de Bruijn graph assemblers for second-generation sequencing reads. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5053405	-
dc.identifier.uri	http://hdl.handle.net/10722/188286	-
dc.description.abstract	The recent advance of second-generation sequencing technologies has made it possible to generate a vast amount of short read sequences from a DNA (cDNA) sample. Current short read assemblers make use of the de Bruijn graph, in which each vertex is a k-mer and each edge connecting vertex u and vertex v represents u and v appearing in a read consecutively, to produce contigs. There are three major problems for de Bruijn graph assemblers: (1) branch problem, due to errors and repeats; (2) gap problem, due to low or uneven sequencing depth; and (3) error problem, due to sequencing errors. A proper choice of k value is a crucial tradeoff in de Bruijn graph assemblers: a low k value leads to fewer gaps but more branches; a high k value leads to fewer branches but more gaps. In this thesis, I first analyze the fundamental genome assembly problem and then propose an iterative de Bruijn graph assembler (IDBA), which iterates from low to high k values, to construct a de Bruijn graph with fewer branches and fewer gaps than any other de Bruijn graph assembler using a fixed k value. Then, the second-generation sequencing data from metagenomic, single-cell and transcriptome samples is investigated. IDBA is then tailored with special treatments to handle the specific issues for each kind of data. For metagenomic sequencing data, a graph partition algorithm is proposed to separate de Bruijn graph into dense components, which represent similar regions in subspecies from the same species, and multiple sequence alignment is used to produce consensus of each component. For sequencing data with highly uneven depth such as single-cell and metagenomic sequencing data, a method called local assembly is designed to reconstruct missing k-mers in low-depth regions. Then, based on the observation that short and relatively low-depth contigs are more likely erroneous, progressive depth on contigs is used to remove errors in both low-depth and high-depth regions iteratively. For transcriptome sequencing data, a variant of the progressive depth method is adopted to decompose the de Bruijn graph into components corresponding to transcripts from the same gene, and then the transcripts are found in each component by considering the reads and paired-end reads support. Plenty of experiments on both simulated and real data show that IDBA assemblers outperform the existing assemblers by constructing longer contigs with higher completeness and similar or better accuracy. The running time of IDBA assemblers is comparable to existing algorithms, while the memory cost is usually less than the others.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.source.uri	http://hub.hku.hk/bib/B50534051	-
dc.subject.lcsh	Nucleotide sequence - Data processing.	-
dc.subject.lcsh	Sequence alignment (Bioinformatics)	-
dc.title	Iterative de Bruijn graph assemblers for second-generation sequencing reads	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5053405	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_b5053405	-
dc.date.hkucongregation	2013	-
dc.identifier.mmsid	991035480719703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Iterative de Bruijn graph assemblers for second-generation sequencing reads

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats