File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

postgraduate thesis: Iterative de Bruijn graph assemblers for second-generation sequencing reads

TitleIterative de Bruijn graph assemblers for second-generation sequencing reads
Authors
Advisors
Advisor(s):Chin, FYL
Issue Date2012
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Peng, Y. [彭煜]. (2012). Iterative de Bruijn graph assemblers for second-generation sequencing reads. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5053405
AbstractThe recent advance of second-generation sequencing technologies has made it possible to generate a vast amount of short read sequences from a DNA (cDNA) sample. Current short read assemblers make use of the de Bruijn graph, in which each vertex is a k-mer and each edge connecting vertex u and vertex v represents u and v appearing in a read consecutively, to produce contigs. There are three major problems for de Bruijn graph assemblers: (1) branch problem, due to errors and repeats; (2) gap problem, due to low or uneven sequencing depth; and (3) error problem, due to sequencing errors. A proper choice of k value is a crucial tradeoff in de Bruijn graph assemblers: a low k value leads to fewer gaps but more branches; a high k value leads to fewer branches but more gaps. In this thesis, I first analyze the fundamental genome assembly problem and then propose an iterative de Bruijn graph assembler (IDBA), which iterates from low to high k values, to construct a de Bruijn graph with fewer branches and fewer gaps than any other de Bruijn graph assembler using a fixed k value. Then, the second-generation sequencing data from metagenomic, single-cell and transcriptome samples is investigated. IDBA is then tailored with special treatments to handle the specific issues for each kind of data. For metagenomic sequencing data, a graph partition algorithm is proposed to separate de Bruijn graph into dense components, which represent similar regions in subspecies from the same species, and multiple sequence alignment is used to produce consensus of each component. For sequencing data with highly uneven depth such as single-cell and metagenomic sequencing data, a method called local assembly is designed to reconstruct missing k-mers in low-depth regions. Then, based on the observation that short and relatively low-depth contigs are more likely erroneous, progressive depth on contigs is used to remove errors in both low-depth and high-depth regions iteratively. For transcriptome sequencing data, a variant of the progressive depth method is adopted to decompose the de Bruijn graph into components corresponding to transcripts from the same gene, and then the transcripts are found in each component by considering the reads and paired-end reads support. Plenty of experiments on both simulated and real data show that IDBA assemblers outperform the existing assemblers by constructing longer contigs with higher completeness and similar or better accuracy. The running time of IDBA assemblers is comparable to existing algorithms, while the memory cost is usually less than the others.
DegreeDoctor of Philosophy
SubjectNucleotide sequence - Data processing.
Sequence alignment (Bioinformatics)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/188286
HKU Library Item IDb5053405

 

DC FieldValueLanguage
dc.contributor.advisorChin, FYL-
dc.contributor.authorPeng, Yu-
dc.contributor.author彭煜-
dc.date.accessioned2013-08-27T08:03:15Z-
dc.date.available2013-08-27T08:03:15Z-
dc.date.issued2012-
dc.identifier.citationPeng, Y. [彭煜]. (2012). Iterative de Bruijn graph assemblers for second-generation sequencing reads. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5053405-
dc.identifier.urihttp://hdl.handle.net/10722/188286-
dc.description.abstractThe recent advance of second-generation sequencing technologies has made it possible to generate a vast amount of short read sequences from a DNA (cDNA) sample. Current short read assemblers make use of the de Bruijn graph, in which each vertex is a k-mer and each edge connecting vertex u and vertex v represents u and v appearing in a read consecutively, to produce contigs. There are three major problems for de Bruijn graph assemblers: (1) branch problem, due to errors and repeats; (2) gap problem, due to low or uneven sequencing depth; and (3) error problem, due to sequencing errors. A proper choice of k value is a crucial tradeoff in de Bruijn graph assemblers: a low k value leads to fewer gaps but more branches; a high k value leads to fewer branches but more gaps. In this thesis, I first analyze the fundamental genome assembly problem and then propose an iterative de Bruijn graph assembler (IDBA), which iterates from low to high k values, to construct a de Bruijn graph with fewer branches and fewer gaps than any other de Bruijn graph assembler using a fixed k value. Then, the second-generation sequencing data from metagenomic, single-cell and transcriptome samples is investigated. IDBA is then tailored with special treatments to handle the specific issues for each kind of data. For metagenomic sequencing data, a graph partition algorithm is proposed to separate de Bruijn graph into dense components, which represent similar regions in subspecies from the same species, and multiple sequence alignment is used to produce consensus of each component. For sequencing data with highly uneven depth such as single-cell and metagenomic sequencing data, a method called local assembly is designed to reconstruct missing k-mers in low-depth regions. Then, based on the observation that short and relatively low-depth contigs are more likely erroneous, progressive depth on contigs is used to remove errors in both low-depth and high-depth regions iteratively. For transcriptome sequencing data, a variant of the progressive depth method is adopted to decompose the de Bruijn graph into components corresponding to transcripts from the same gene, and then the transcripts are found in each component by considering the reads and paired-end reads support. Plenty of experiments on both simulated and real data show that IDBA assemblers outperform the existing assemblers by constructing longer contigs with higher completeness and similar or better accuracy. The running time of IDBA assemblers is comparable to existing algorithms, while the memory cost is usually less than the others.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.source.urihttp://hub.hku.hk/bib/B50534051-
dc.subject.lcshNucleotide sequence - Data processing.-
dc.subject.lcshSequence alignment (Bioinformatics)-
dc.titleIterative de Bruijn graph assemblers for second-generation sequencing reads-
dc.typePG_Thesis-
dc.identifier.hkulb5053405-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.5353/th_b5053405-
dc.date.hkucongregation2013-
dc.identifier.mmsid991035480719703414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats