File Download
Supplementary

postgraduate thesis: Discovery of novel genomic sequences and DNA methylated transcriptomic analysis

TitleDiscovery of novel genomic sequences and DNA methylated transcriptomic analysis
Authors
Advisors
Advisor(s):Luo, RLam, TW
Issue Date2022
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Li, Q. [李秋慧]. (2022). Discovery of novel genomic sequences and DNA methylated transcriptomic analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractIn genomics studies, generating a completed reference genome is critical for decoding genetic variations and the association with human diseases. However, the current reference genomes are composed of few individuals, which under-represent the human populations. This thesis focuses on undiscovered sequences to increase the diversity of reference genomes and to uncover biological significance of the detected sequences. DNA sequences absent in the human reference genome are classified as novel sequences. The discovery of these missed sequences is significant to explore the genomic diversity of populations and determine the genetic basis of human diseases. I extracted novel sequences from 486 short-read sequencing Han Chinese genomes and used them to construct a Chinese pan-genome totaling 276 Mbp. Then, I classified these sequences into individual-specific and common components and proposed the common sequence size is uncapped with a growing population. The 46.646 Mbp common sequences obtained from the 486 individuals improved the accuracy of variant calling and mapping rate when added to the reference genome. I also analyzed the placements of these common sequences and found that they came from genomic regions characterized by high mutation rates and low pathogenicity. This study authenticates the Chinese pan-genome as representative of DNA sequences specific to the Han Chinese population missing from the GRCh38 reference genome and suggests the newly defined common sequences as candidates to supplement the current human reference. However, short-read sequencing still misses a considerable amount of genomic information. To construct a complete novel sequence picture, I designed an Assembly-Free Novel Sequence (AF-NS) approach to discover novel sequences from Oxford Nanopore Technology (ONT) long reads. Among the newly detected sequences using AF-NS, more than 95% are omitted from those using long-read assemblers, and 85% are not present in short reads of Illumina. Then I identified the common novel sequences among all the samples and revealed their association with the binding motifs of transcription factors. Regarding the placements of the novel sequences, I found about 70% enriched in repeat regions and generated 430 for one specific subpopulation that might be related to their evolution. My analysis demonstrates the advance of the Assembly-Free approach to capture more novel sequences over other assembler-based methods. Combining the long-read data with powerful analytical methods is a robust way to improve the completeness of novel DNA sequences. It is well known that epigenetic modification of genomic sequences could largely affect gene transcription. Therefore, I extended my analysis to transcriptome and aimed to unravel the effect of DNA methylation on pre-mRNA processing. I devised TC-RENO, a genome-wide de novo isoform detection method for ONT direct RNA-sequencing data. This approach eliminates the dependency on the reference transcripts, achieving the profiling and characterization of isoforms in unannotated regions. Using TC-RENO, I identified 93,513 isoforms from Arabidopsis wild-type and three DNA methylation-related mutants, approximately twice the number of reference transcripts. Combining these isoforms and DNA methylation data, I revealed that DNA methylation plays a key role in selecting transcription initiation and termination sites and splicing sites and determining poly(A) tail length in Arabidopsis.
DegreeDoctor of Philosophy
SubjectGenomes
DNA - Methylation
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/332073

 

DC FieldValueLanguage
dc.contributor.advisorLuo, R-
dc.contributor.advisorLam, TW-
dc.contributor.authorLi, Qiuhui-
dc.contributor.author李秋慧-
dc.date.accessioned2023-09-29T04:40:19Z-
dc.date.available2023-09-29T04:40:19Z-
dc.date.issued2022-
dc.identifier.citationLi, Q. [李秋慧]. (2022). Discovery of novel genomic sequences and DNA methylated transcriptomic analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/332073-
dc.description.abstractIn genomics studies, generating a completed reference genome is critical for decoding genetic variations and the association with human diseases. However, the current reference genomes are composed of few individuals, which under-represent the human populations. This thesis focuses on undiscovered sequences to increase the diversity of reference genomes and to uncover biological significance of the detected sequences. DNA sequences absent in the human reference genome are classified as novel sequences. The discovery of these missed sequences is significant to explore the genomic diversity of populations and determine the genetic basis of human diseases. I extracted novel sequences from 486 short-read sequencing Han Chinese genomes and used them to construct a Chinese pan-genome totaling 276 Mbp. Then, I classified these sequences into individual-specific and common components and proposed the common sequence size is uncapped with a growing population. The 46.646 Mbp common sequences obtained from the 486 individuals improved the accuracy of variant calling and mapping rate when added to the reference genome. I also analyzed the placements of these common sequences and found that they came from genomic regions characterized by high mutation rates and low pathogenicity. This study authenticates the Chinese pan-genome as representative of DNA sequences specific to the Han Chinese population missing from the GRCh38 reference genome and suggests the newly defined common sequences as candidates to supplement the current human reference. However, short-read sequencing still misses a considerable amount of genomic information. To construct a complete novel sequence picture, I designed an Assembly-Free Novel Sequence (AF-NS) approach to discover novel sequences from Oxford Nanopore Technology (ONT) long reads. Among the newly detected sequences using AF-NS, more than 95% are omitted from those using long-read assemblers, and 85% are not present in short reads of Illumina. Then I identified the common novel sequences among all the samples and revealed their association with the binding motifs of transcription factors. Regarding the placements of the novel sequences, I found about 70% enriched in repeat regions and generated 430 for one specific subpopulation that might be related to their evolution. My analysis demonstrates the advance of the Assembly-Free approach to capture more novel sequences over other assembler-based methods. Combining the long-read data with powerful analytical methods is a robust way to improve the completeness of novel DNA sequences. It is well known that epigenetic modification of genomic sequences could largely affect gene transcription. Therefore, I extended my analysis to transcriptome and aimed to unravel the effect of DNA methylation on pre-mRNA processing. I devised TC-RENO, a genome-wide de novo isoform detection method for ONT direct RNA-sequencing data. This approach eliminates the dependency on the reference transcripts, achieving the profiling and characterization of isoforms in unannotated regions. Using TC-RENO, I identified 93,513 isoforms from Arabidopsis wild-type and three DNA methylation-related mutants, approximately twice the number of reference transcripts. Combining these isoforms and DNA methylation data, I revealed that DNA methylation plays a key role in selecting transcription initiation and termination sites and splicing sites and determining poly(A) tail length in Arabidopsis. -
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshGenomes-
dc.subject.lcshDNA - Methylation-
dc.titleDiscovery of novel genomic sequences and DNA methylated transcriptomic analysis-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2022-
dc.identifier.mmsid991044609109403414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats