File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Computational methods to bridge genomics data with medical applications
Title | Computational methods to bridge genomics data with medical applications |
---|---|
Authors | |
Issue Date | 2024 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Su, J. [苏俊豪]. (2024). Computational methods to bridge genomics data with medical applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Recent advancements in third-generation sequencing technologies (TGS) have significantly improved the detection of genetic variants with a broad spectrum. However, the application of the sequencing data from TGS in medical fields is largely hindered by the high error rate from Oxford Nanopore Technologies (ONT) and thus limits the development of tools for sequencing data analysis. This thesis addresses these challenges and provides solutions for application in medical fields. This thesis focuses on developing methods for 1) accurate identification of variants, 2) application in different medical questions, and 3) identifying relationships between genotype and phenotype.
Accurate identification of genetic variants is crucial in genome-based genetic studies. Existing approaches for variant calling from family trio data based on ONT data suffer from low detection accuracy due to treating trio variant calling as three independent tasks. To address this problem, Clair3-Trio is developed, which is the first variant caller specifically tailored for family trio data from ONT long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model that inputs trio sequencing information and outputs all predicted variants for the trio in a single model. To further improve accuracy, this thesis introduces the MCVLoss function, which leverages the explicit encoding of Mendelian inheritance. Clair3-Trio has demonstrated large improvements in benchmarks over existing methods, predicting 85% fewer Mendelian inheritance violations. This thesis also establishes the Trio-to-Trio model as the optimal solution to trio variant calling.
To facilitate the translation of variant calling into medical applications, this thesis studies two of the most common infectious disease-causing microbes worldwide: Mycobacterium tuberculosis (TB) and Human Immunodeficiency Virus (HIV). The amount of microbes' DNA in metagenomic samples can be low. To address this problem, this thesis applies ONT MinION adaptive sequencing and builds a pipeline named ONT-TB-NF for TB clinical detection and drug resistance prediction. The consequent results indicate that ONT-TB-NF efficiently enriches TB with an over 2.2-fold increase in detection rate. On the other hand, microbes such as HIV have many quasispecies in sequenced samples due to their high mutation rate. To analyze HIV data, this thesis introduces ClusterV, which employs iterative variant calling and hierarchical clustering to identify quasispecies and to create comprehensive drug resistance reports. The accuracy of ClusterV is validated using real datasets, achieving an F1-score of 0.964 for variant calling. Therefore, this thesis presents a set of computational methods that accurately analyze infectious disease-causing microbes' genomes, providing the likely potential in clinical applications.
To enhance the clinical interpretation of genetic variants, this thesis introduces RENET2, a deep learning-based relation extraction method. RENET2 implements section filtering and model ensemble to extract gene-disease associations from full-text articles. Compared to existing methods, RENET2 achieves an F1-score of +20% for extracting gene-disease associations from an annotated full-text dataset. This thesis applied RENET2 to 1) build a dataset containing approximately 3.72M gene-disease association relations from around 1.89M full-text articles from PubMed Central and 2) explore all associated genes with COVID-19 by mining the gene-disease relations from the LitCovid dataset. With RENET2, the interpretation of genetic variants has been significantly expanded. |
Degree | Doctor of Philosophy |
Subject | Nucleotide sequence - Data processing |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/335159 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Su, Junhao | - |
dc.contributor.author | 苏俊豪 | - |
dc.date.accessioned | 2023-11-13T07:45:03Z | - |
dc.date.available | 2023-11-13T07:45:03Z | - |
dc.date.issued | 2024 | - |
dc.identifier.citation | Su, J. [苏俊豪]. (2024). Computational methods to bridge genomics data with medical applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/335159 | - |
dc.description.abstract | Recent advancements in third-generation sequencing technologies (TGS) have significantly improved the detection of genetic variants with a broad spectrum. However, the application of the sequencing data from TGS in medical fields is largely hindered by the high error rate from Oxford Nanopore Technologies (ONT) and thus limits the development of tools for sequencing data analysis. This thesis addresses these challenges and provides solutions for application in medical fields. This thesis focuses on developing methods for 1) accurate identification of variants, 2) application in different medical questions, and 3) identifying relationships between genotype and phenotype. Accurate identification of genetic variants is crucial in genome-based genetic studies. Existing approaches for variant calling from family trio data based on ONT data suffer from low detection accuracy due to treating trio variant calling as three independent tasks. To address this problem, Clair3-Trio is developed, which is the first variant caller specifically tailored for family trio data from ONT long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model that inputs trio sequencing information and outputs all predicted variants for the trio in a single model. To further improve accuracy, this thesis introduces the MCVLoss function, which leverages the explicit encoding of Mendelian inheritance. Clair3-Trio has demonstrated large improvements in benchmarks over existing methods, predicting 85% fewer Mendelian inheritance violations. This thesis also establishes the Trio-to-Trio model as the optimal solution to trio variant calling. To facilitate the translation of variant calling into medical applications, this thesis studies two of the most common infectious disease-causing microbes worldwide: Mycobacterium tuberculosis (TB) and Human Immunodeficiency Virus (HIV). The amount of microbes' DNA in metagenomic samples can be low. To address this problem, this thesis applies ONT MinION adaptive sequencing and builds a pipeline named ONT-TB-NF for TB clinical detection and drug resistance prediction. The consequent results indicate that ONT-TB-NF efficiently enriches TB with an over 2.2-fold increase in detection rate. On the other hand, microbes such as HIV have many quasispecies in sequenced samples due to their high mutation rate. To analyze HIV data, this thesis introduces ClusterV, which employs iterative variant calling and hierarchical clustering to identify quasispecies and to create comprehensive drug resistance reports. The accuracy of ClusterV is validated using real datasets, achieving an F1-score of 0.964 for variant calling. Therefore, this thesis presents a set of computational methods that accurately analyze infectious disease-causing microbes' genomes, providing the likely potential in clinical applications. To enhance the clinical interpretation of genetic variants, this thesis introduces RENET2, a deep learning-based relation extraction method. RENET2 implements section filtering and model ensemble to extract gene-disease associations from full-text articles. Compared to existing methods, RENET2 achieves an F1-score of +20% for extracting gene-disease associations from an annotated full-text dataset. This thesis applied RENET2 to 1) build a dataset containing approximately 3.72M gene-disease association relations from around 1.89M full-text articles from PubMed Central and 2) explore all associated genes with COVID-19 by mining the gene-disease relations from the LitCovid dataset. With RENET2, the interpretation of genetic variants has been significantly expanded. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Nucleotide sequence - Data processing | - |
dc.title | Computational methods to bridge genomics data with medical applications | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2024 | - |
dc.identifier.mmsid | 991044736607603414 | - |