File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

postgraduate thesis: Time-efficient and highly sensitive solutions for large-scale sequence alignments

TitleTime-efficient and highly sensitive solutions for large-scale sequence alignments
Authors
Advisors
Advisor(s):Ting, HF
Issue Date2018
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Mai, H. [麥慧君]. (2018). Time-efficient and highly sensitive solutions for large-scale sequence alignments. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractSequence alignment has been a widely used and effective methodology to explore the functional characteristics of sequences. As the rapid advancement of next-generation sequencing (NGS) technologies makes more and more biological sequences available to study, it has been a challenge to analyse the newly generated genomic data efficiently. This thesis introduces software solutions for aligning large amounts of DNA sequences efficiently and sensitively. The first software we designed and implemented is LASTM, which is for the problem of whole-genome alignment. This problem often involves comparing two long genomes with billions of base pairs. Before LASTM, some existing tools have made the comparison of two large genomes possible and efficient with a sacrifice in sensitivity. However, they become very slow if the extra sensitivity is needed. LASTM is developed to handle this problem and proposes a simple but effective method to improve the sensitivity of existing whole-genome alignment software without spending too much extra running time. Our second software is AC-DIAMOND, which is used for rapid and sensitive DNA-protein alignment. The computational bottlenecks of previous DNA-protein aligners limited their applications on the alignment of large-scale datasets against a protein database. We implemented a time-efficient aligner, called AC-DIAMOND to tackle the computational bottlenecks. The first version of AC-DIAMOND, namely AC-DIAMOND v0, it reduces the time of reloading same reference sequences and reconstructing same reference indexes by compressing reference indexes. Moreover, AC-DIAMOND v0 exploits SIMD technologies to accelerate the time-consuming dynamic programming process. When aligning large amounts of long reads or assembled contigs to protein databases, AC-DIAMOND v0 gained a 4-fold speed-up. Recently, AC-DIAMOND v0 has been applied to the pathogen detection pipeline MegaPath to solve real clinical problems in a sensitive and fast manner. To further improve AC-DIAMOND v0, we designed and implemented AC-DIAMOND v1. By making use of an even compressed reference index and adopting the adaptive seed-length search, AC-DIAMOND v1 provides an more effective method to locate seeds between the dataset and protein database. In addition, AC-DIAMOND v1 uses a better SIMD implementation and packing strategy to parallelize the dynamic programming process. With these new improvements, AC-DIAMOND v1 saved nearly 40% of running time of the previous version v0 and achieved a 7-fold speed-up compared with DIAMOND. Most importantly, AC-DIAMOND did not sacrifice the sensitivity and provided the similar sensitivity as previous pioneer aligner DIAMOND.
DegreeDoctor of Philosophy
SubjectSequence alignment (Bioinformatics)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/261462

 

DC FieldValueLanguage
dc.contributor.advisorTing, HF-
dc.contributor.authorMai, Huijun-
dc.contributor.author麥慧君-
dc.date.accessioned2018-09-20T06:43:46Z-
dc.date.available2018-09-20T06:43:46Z-
dc.date.issued2018-
dc.identifier.citationMai, H. [麥慧君]. (2018). Time-efficient and highly sensitive solutions for large-scale sequence alignments. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/261462-
dc.description.abstractSequence alignment has been a widely used and effective methodology to explore the functional characteristics of sequences. As the rapid advancement of next-generation sequencing (NGS) technologies makes more and more biological sequences available to study, it has been a challenge to analyse the newly generated genomic data efficiently. This thesis introduces software solutions for aligning large amounts of DNA sequences efficiently and sensitively. The first software we designed and implemented is LASTM, which is for the problem of whole-genome alignment. This problem often involves comparing two long genomes with billions of base pairs. Before LASTM, some existing tools have made the comparison of two large genomes possible and efficient with a sacrifice in sensitivity. However, they become very slow if the extra sensitivity is needed. LASTM is developed to handle this problem and proposes a simple but effective method to improve the sensitivity of existing whole-genome alignment software without spending too much extra running time. Our second software is AC-DIAMOND, which is used for rapid and sensitive DNA-protein alignment. The computational bottlenecks of previous DNA-protein aligners limited their applications on the alignment of large-scale datasets against a protein database. We implemented a time-efficient aligner, called AC-DIAMOND to tackle the computational bottlenecks. The first version of AC-DIAMOND, namely AC-DIAMOND v0, it reduces the time of reloading same reference sequences and reconstructing same reference indexes by compressing reference indexes. Moreover, AC-DIAMOND v0 exploits SIMD technologies to accelerate the time-consuming dynamic programming process. When aligning large amounts of long reads or assembled contigs to protein databases, AC-DIAMOND v0 gained a 4-fold speed-up. Recently, AC-DIAMOND v0 has been applied to the pathogen detection pipeline MegaPath to solve real clinical problems in a sensitive and fast manner. To further improve AC-DIAMOND v0, we designed and implemented AC-DIAMOND v1. By making use of an even compressed reference index and adopting the adaptive seed-length search, AC-DIAMOND v1 provides an more effective method to locate seeds between the dataset and protein database. In addition, AC-DIAMOND v1 uses a better SIMD implementation and packing strategy to parallelize the dynamic programming process. With these new improvements, AC-DIAMOND v1 saved nearly 40% of running time of the previous version v0 and achieved a 7-fold speed-up compared with DIAMOND. Most importantly, AC-DIAMOND did not sacrifice the sensitivity and provided the similar sensitivity as previous pioneer aligner DIAMOND.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshSequence alignment (Bioinformatics)-
dc.titleTime-efficient and highly sensitive solutions for large-scale sequence alignments-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.5353/th_991044040580903414-
dc.date.hkucongregation2018-
dc.identifier.mmsid991044040580903414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats