File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

postgraduate thesis: Genome analyses based on RNA sequences and DNA optical maps

TitleGenome analyses based on RNA sequences and DNA optical maps
Authors
Advisors
Advisor(s):Yiu, SM
Issue Date2018
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Li, M. [李夢璐]. (2018). Genome analyses based on RNA sequences and DNA optical maps. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractDNA, RNA and proteins are three major macromolecules essential for all known forms of life. The genetic information within a biological system naturally transfers from DNA to RNA and RNA to protein by transcription and translation. Investiga- tions on DNA optical maps and RNA secondary structures are challenging tasks to expand our understanding of genomics. An RNA molecule exists in nature as a single strand of nucleotides folding back to itself. This is because some bases are paired up by hydrogen bonds, forming base pair interactions termed as RNA secondary structures. For many RNA molecules, their secondary structures are often more important for correct functionalities (gene expression, regulation, catalysis and cellular signal communication) than their plain sequences. Although the majority of RNAs fold into simple secondary structures, pseudoknots are found in almost all classes of RNAs. Their existence makes the secondary structure prediction NP-hard. In order to predict complicated secondary structures, we devised a grammar-based machine learning method to predict secondary structures for all RNA sequences in Rfam. Regarding every structure as a unique operation path to generate it, we are able to train a rule transition probability matrix and base emission probability matrix. These matrices determine the operation path to generate a secondary structure for a given RNA sequence. Experimental results show that our approach performs well with a high PPV and sensitivity, particularly for highly-pseudoknotted RNAs. DNA molecules are inherently fully-paired double helices storing biological infor- mation and encode genetic instructions. In recent years, next-generation sequencing technologies have enabled researchers to discover critical ndings in genomics with low cost and high e ciency. On the other hand, the short read length remains a ma- jor obstacle for thorough structural analyses such as de novo assembly and structural variation detection. As a compliment, optical mapping is a high-throughput technique that produces long and high-resolutional restriction maps. To lay a good basis for op- tical map studies, we rstly conducted a probabilistic error study on the alignment results of BioNano RefAligner on CEU trio maps. Sizing error, false cuts, missing cuts and unknown molecule orientation are carefully modeled using maximal likelihood es- timation. Using the trio of samples and simulated datasets, this error model exhibits a better tting to BioNano optical maps than the previous model. In prediction of the di cult regions that are inclined to higher error rates, our error model performs more accurate than other popular error models. Taking the optical map investigation one step further, an iterative framework is proposed to assemble optical maps to contigs. Each iteration begins with pairwise alignments among all input optical maps. Con dent alignments compose an overlap graph. By careful graph correction and path search, each connected graph component yields a contig. The assembly process iterates by taking the resulting contigs as new inputs. The algorithm stops when contigs no longer extend or merge. Experiments on E.coli simulated and real datasets show that our assembler is capable of constructing long and accurate consensus maps without misconnections.
DegreeDoctor of Philosophy
SubjectNucleotide sequence - Data processing
Gene mapping - Data processing
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/255439

 

DC FieldValueLanguage
dc.contributor.advisorYiu, SM-
dc.contributor.authorLi, Menglu-
dc.contributor.author李夢璐-
dc.date.accessioned2018-07-05T07:43:33Z-
dc.date.available2018-07-05T07:43:33Z-
dc.date.issued2018-
dc.identifier.citationLi, M. [李夢璐]. (2018). Genome analyses based on RNA sequences and DNA optical maps. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/255439-
dc.description.abstractDNA, RNA and proteins are three major macromolecules essential for all known forms of life. The genetic information within a biological system naturally transfers from DNA to RNA and RNA to protein by transcription and translation. Investiga- tions on DNA optical maps and RNA secondary structures are challenging tasks to expand our understanding of genomics. An RNA molecule exists in nature as a single strand of nucleotides folding back to itself. This is because some bases are paired up by hydrogen bonds, forming base pair interactions termed as RNA secondary structures. For many RNA molecules, their secondary structures are often more important for correct functionalities (gene expression, regulation, catalysis and cellular signal communication) than their plain sequences. Although the majority of RNAs fold into simple secondary structures, pseudoknots are found in almost all classes of RNAs. Their existence makes the secondary structure prediction NP-hard. In order to predict complicated secondary structures, we devised a grammar-based machine learning method to predict secondary structures for all RNA sequences in Rfam. Regarding every structure as a unique operation path to generate it, we are able to train a rule transition probability matrix and base emission probability matrix. These matrices determine the operation path to generate a secondary structure for a given RNA sequence. Experimental results show that our approach performs well with a high PPV and sensitivity, particularly for highly-pseudoknotted RNAs. DNA molecules are inherently fully-paired double helices storing biological infor- mation and encode genetic instructions. In recent years, next-generation sequencing technologies have enabled researchers to discover critical ndings in genomics with low cost and high e ciency. On the other hand, the short read length remains a ma- jor obstacle for thorough structural analyses such as de novo assembly and structural variation detection. As a compliment, optical mapping is a high-throughput technique that produces long and high-resolutional restriction maps. To lay a good basis for op- tical map studies, we rstly conducted a probabilistic error study on the alignment results of BioNano RefAligner on CEU trio maps. Sizing error, false cuts, missing cuts and unknown molecule orientation are carefully modeled using maximal likelihood es- timation. Using the trio of samples and simulated datasets, this error model exhibits a better tting to BioNano optical maps than the previous model. In prediction of the di cult regions that are inclined to higher error rates, our error model performs more accurate than other popular error models. Taking the optical map investigation one step further, an iterative framework is proposed to assemble optical maps to contigs. Each iteration begins with pairwise alignments among all input optical maps. Con dent alignments compose an overlap graph. By careful graph correction and path search, each connected graph component yields a contig. The assembly process iterates by taking the resulting contigs as new inputs. The algorithm stops when contigs no longer extend or merge. Experiments on E.coli simulated and real datasets show that our assembler is capable of constructing long and accurate consensus maps without misconnections.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshNucleotide sequence - Data processing-
dc.subject.lcshGene mapping - Data processing-
dc.titleGenome analyses based on RNA sequences and DNA optical maps-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.5353/th_991044019484603414-
dc.date.hkucongregation2018-
dc.identifier.mmsid991044019484603414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats