File Download
Supplementary

postgraduate thesis: Computational methods to bridge genomics data with medical applications

TitleComputational methods to bridge genomics data with medical applications
Authors
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Su, J. [苏俊豪]. (2024). Computational methods to bridge genomics data with medical applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractRecent advancements in third-generation sequencing technologies (TGS) have significantly improved the detection of genetic variants with a broad spectrum. However, the application of the sequencing data from TGS in medical fields is largely hindered by the high error rate from Oxford Nanopore Technologies (ONT) and thus limits the development of tools for sequencing data analysis. This thesis addresses these challenges and provides solutions for application in medical fields. This thesis focuses on developing methods for 1) accurate identification of variants, 2) application in different medical questions, and 3) identifying relationships between genotype and phenotype. Accurate identification of genetic variants is crucial in genome-based genetic studies. Existing approaches for variant calling from family trio data based on ONT data suffer from low detection accuracy due to treating trio variant calling as three independent tasks. To address this problem, Clair3-Trio is developed, which is the first variant caller specifically tailored for family trio data from ONT long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model that inputs trio sequencing information and outputs all predicted variants for the trio in a single model. To further improve accuracy, this thesis introduces the MCVLoss function, which leverages the explicit encoding of Mendelian inheritance. Clair3-Trio has demonstrated large improvements in benchmarks over existing methods, predicting 85% fewer Mendelian inheritance violations. This thesis also establishes the Trio-to-Trio model as the optimal solution to trio variant calling. To facilitate the translation of variant calling into medical applications, this thesis studies two of the most common infectious disease-causing microbes worldwide: Mycobacterium tuberculosis (TB) and Human Immunodeficiency Virus (HIV). The amount of microbes' DNA in metagenomic samples can be low. To address this problem, this thesis applies ONT MinION adaptive sequencing and builds a pipeline named ONT-TB-NF for TB clinical detection and drug resistance prediction. The consequent results indicate that ONT-TB-NF efficiently enriches TB with an over 2.2-fold increase in detection rate. On the other hand, microbes such as HIV have many quasispecies in sequenced samples due to their high mutation rate. To analyze HIV data, this thesis introduces ClusterV, which employs iterative variant calling and hierarchical clustering to identify quasispecies and to create comprehensive drug resistance reports. The accuracy of ClusterV is validated using real datasets, achieving an F1-score of 0.964 for variant calling. Therefore, this thesis presents a set of computational methods that accurately analyze infectious disease-causing microbes' genomes, providing the likely potential in clinical applications. To enhance the clinical interpretation of genetic variants, this thesis introduces RENET2, a deep learning-based relation extraction method. RENET2 implements section filtering and model ensemble to extract gene-disease associations from full-text articles. Compared to existing methods, RENET2 achieves an F1-score of +20% for extracting gene-disease associations from an annotated full-text dataset. This thesis applied RENET2 to 1) build a dataset containing approximately 3.72M gene-disease association relations from around 1.89M full-text articles from PubMed Central and 2) explore all associated genes with COVID-19 by mining the gene-disease relations from the LitCovid dataset. With RENET2, the interpretation of genetic variants has been significantly expanded.
DegreeDoctor of Philosophy
SubjectNucleotide sequence - Data processing
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/335159

 

DC FieldValueLanguage
dc.contributor.authorSu, Junhao-
dc.contributor.author苏俊豪-
dc.date.accessioned2023-11-13T07:45:03Z-
dc.date.available2023-11-13T07:45:03Z-
dc.date.issued2024-
dc.identifier.citationSu, J. [苏俊豪]. (2024). Computational methods to bridge genomics data with medical applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/335159-
dc.description.abstractRecent advancements in third-generation sequencing technologies (TGS) have significantly improved the detection of genetic variants with a broad spectrum. However, the application of the sequencing data from TGS in medical fields is largely hindered by the high error rate from Oxford Nanopore Technologies (ONT) and thus limits the development of tools for sequencing data analysis. This thesis addresses these challenges and provides solutions for application in medical fields. This thesis focuses on developing methods for 1) accurate identification of variants, 2) application in different medical questions, and 3) identifying relationships between genotype and phenotype. Accurate identification of genetic variants is crucial in genome-based genetic studies. Existing approaches for variant calling from family trio data based on ONT data suffer from low detection accuracy due to treating trio variant calling as three independent tasks. To address this problem, Clair3-Trio is developed, which is the first variant caller specifically tailored for family trio data from ONT long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model that inputs trio sequencing information and outputs all predicted variants for the trio in a single model. To further improve accuracy, this thesis introduces the MCVLoss function, which leverages the explicit encoding of Mendelian inheritance. Clair3-Trio has demonstrated large improvements in benchmarks over existing methods, predicting 85% fewer Mendelian inheritance violations. This thesis also establishes the Trio-to-Trio model as the optimal solution to trio variant calling. To facilitate the translation of variant calling into medical applications, this thesis studies two of the most common infectious disease-causing microbes worldwide: Mycobacterium tuberculosis (TB) and Human Immunodeficiency Virus (HIV). The amount of microbes' DNA in metagenomic samples can be low. To address this problem, this thesis applies ONT MinION adaptive sequencing and builds a pipeline named ONT-TB-NF for TB clinical detection and drug resistance prediction. The consequent results indicate that ONT-TB-NF efficiently enriches TB with an over 2.2-fold increase in detection rate. On the other hand, microbes such as HIV have many quasispecies in sequenced samples due to their high mutation rate. To analyze HIV data, this thesis introduces ClusterV, which employs iterative variant calling and hierarchical clustering to identify quasispecies and to create comprehensive drug resistance reports. The accuracy of ClusterV is validated using real datasets, achieving an F1-score of 0.964 for variant calling. Therefore, this thesis presents a set of computational methods that accurately analyze infectious disease-causing microbes' genomes, providing the likely potential in clinical applications. To enhance the clinical interpretation of genetic variants, this thesis introduces RENET2, a deep learning-based relation extraction method. RENET2 implements section filtering and model ensemble to extract gene-disease associations from full-text articles. Compared to existing methods, RENET2 achieves an F1-score of +20% for extracting gene-disease associations from an annotated full-text dataset. This thesis applied RENET2 to 1) build a dataset containing approximately 3.72M gene-disease association relations from around 1.89M full-text articles from PubMed Central and 2) explore all associated genes with COVID-19 by mining the gene-disease relations from the LitCovid dataset. With RENET2, the interpretation of genetic variants has been significantly expanded.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshNucleotide sequence - Data processing-
dc.titleComputational methods to bridge genomics data with medical applications-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044736607603414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats