File Download
Supplementary

postgraduate thesis: Improving the quality of large scale multiple sequence alignment

TitleImproving the quality of large scale multiple sequence alignment
Authors
Issue Date2016
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Xie, F. [谢芳萍]. (2016). Improving the quality of large scale multiple sequence alignment. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractBiological sequences have rapidly accumulated due to the development of sequencing technologies. For evolutionary biology, a fundamental question is to reconstruct the phylogenetic relationship of organisms from a bunch of sequences. Multiple sequence alignment (MSA) is necessary for phylogenetic reconstruction, but also for many other kinds of evolutionary analyses, such as motifs detection, protein structure prediction. In this thesis, we address the challenge of accurately and efficiently constructing multiple sequence alignment for a large number of sequences. We introduce a novel approach Guide Tree Decomposition and Outliers Detection (DAO), which adapts the decomposition technique since it makes alignment of ultra-large datasets possible. In decomposition-based methods, it divides sequences into subsets using a guide tree, obtains alignment (sub-MSA) for each subset, and then merges these alignments. Decomposition is a recursive step and it does not stop until the size of the current tree is small enough, e.g. smaller than 1,800 sequences. The main improvement of DAO is that sequences are divided into subsets following the centroid path of the guide tree, named centroid path decomposition. Merging sub-MSAs hierarchically follows the reverse order of edge removals during the previous decomposition step. Unlike the methods implemented in some existing tools, after centroid path decomposition, we guarantee that the merging follows the order of the original guide tree. For families with tens of thousands of sequences, outlier sequences may present and disrupt the construction of multiple sequence alignment. Outliers in a family may come from a variety of sources. They are rather different from the majority of sequences and can be distinguished according to their positions in the guide tree. We propose a simple method to deal with the problem of outliers in large scale datasets. In DAO, sequences in small branches of the guide tree, e.g., subset with less than 300 sequences, are been determined as outliers during decomposition step. The proposed approach DAO combines centroid path decomposition and outliers detection technique, and evaluation experiments are carried out on standard protein benchmark databases, including HomFam and OXFam. It shows that DAO gains a clear advantage in accuracy on the benchmark databases experiments. All-to-all pairwise Dynamic Programming (DP) calculation is used to improve the accuracy of MSA, in which a relatively rigorous guide tree is built. Unfortunately, the accurate algorithm is not scalable due to unpractical computational resource requirement for datasets with tens of thousands of sequences. We extend the proposed method DAO by constructing a more accurate guide tree using an external phylogenetic tree software Fasttree-2, named DAO_ml. Approximate maximum likelihood is used in Fasttree-2 to generate a phylogenetic tree with alignment input from existing tools. The remaining steps of DAO ml’s algorithm are consistent with the algorithm of DAO. Experi- mental results show that DAO_ml significant improves the accuracy of DAO. Although alignment quality of DAO_ml is not so good as the accurate algorithm mentioned above, DAO_ml shows a big advantage in the requirement of computational resource.
DegreeMaster of Philosophy
SubjectSequence alignment (Bioinformatics)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/241433
HKU Library Item IDb5864198

 

DC FieldValueLanguage
dc.contributor.authorXie, Fangping-
dc.contributor.author谢芳萍-
dc.date.accessioned2017-06-13T02:07:52Z-
dc.date.available2017-06-13T02:07:52Z-
dc.date.issued2016-
dc.identifier.citationXie, F. [谢芳萍]. (2016). Improving the quality of large scale multiple sequence alignment. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/241433-
dc.description.abstractBiological sequences have rapidly accumulated due to the development of sequencing technologies. For evolutionary biology, a fundamental question is to reconstruct the phylogenetic relationship of organisms from a bunch of sequences. Multiple sequence alignment (MSA) is necessary for phylogenetic reconstruction, but also for many other kinds of evolutionary analyses, such as motifs detection, protein structure prediction. In this thesis, we address the challenge of accurately and efficiently constructing multiple sequence alignment for a large number of sequences. We introduce a novel approach Guide Tree Decomposition and Outliers Detection (DAO), which adapts the decomposition technique since it makes alignment of ultra-large datasets possible. In decomposition-based methods, it divides sequences into subsets using a guide tree, obtains alignment (sub-MSA) for each subset, and then merges these alignments. Decomposition is a recursive step and it does not stop until the size of the current tree is small enough, e.g. smaller than 1,800 sequences. The main improvement of DAO is that sequences are divided into subsets following the centroid path of the guide tree, named centroid path decomposition. Merging sub-MSAs hierarchically follows the reverse order of edge removals during the previous decomposition step. Unlike the methods implemented in some existing tools, after centroid path decomposition, we guarantee that the merging follows the order of the original guide tree. For families with tens of thousands of sequences, outlier sequences may present and disrupt the construction of multiple sequence alignment. Outliers in a family may come from a variety of sources. They are rather different from the majority of sequences and can be distinguished according to their positions in the guide tree. We propose a simple method to deal with the problem of outliers in large scale datasets. In DAO, sequences in small branches of the guide tree, e.g., subset with less than 300 sequences, are been determined as outliers during decomposition step. The proposed approach DAO combines centroid path decomposition and outliers detection technique, and evaluation experiments are carried out on standard protein benchmark databases, including HomFam and OXFam. It shows that DAO gains a clear advantage in accuracy on the benchmark databases experiments. All-to-all pairwise Dynamic Programming (DP) calculation is used to improve the accuracy of MSA, in which a relatively rigorous guide tree is built. Unfortunately, the accurate algorithm is not scalable due to unpractical computational resource requirement for datasets with tens of thousands of sequences. We extend the proposed method DAO by constructing a more accurate guide tree using an external phylogenetic tree software Fasttree-2, named DAO_ml. Approximate maximum likelihood is used in Fasttree-2 to generate a phylogenetic tree with alignment input from existing tools. The remaining steps of DAO ml’s algorithm are consistent with the algorithm of DAO. Experi- mental results show that DAO_ml significant improves the accuracy of DAO. Although alignment quality of DAO_ml is not so good as the accurate algorithm mentioned above, DAO_ml shows a big advantage in the requirement of computational resource.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.subject.lcshSequence alignment (Bioinformatics)-
dc.titleImproving the quality of large scale multiple sequence alignment-
dc.typePG_Thesis-
dc.identifier.hkulb5864198-
dc.description.thesisnameMaster of Philosophy-
dc.description.thesislevelMaster-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.identifier.mmsid991026391209703414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats