Improving the quality of large scale multiple sequence alignment

Xie, Fangping; 谢芳萍

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Improving the quality of large scale multiple sequence alignment

Title	Improving the quality of large scale multiple sequence alignment
Authors	Xie, Fangping 谢芳萍
Issue Date	2016
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Xie, F. [谢芳萍]. (2016). Improving the quality of large scale multiple sequence alignment. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Biological sequences have rapidly accumulated due to the development of sequencing technologies. For evolutionary biology, a fundamental question is to reconstruct the phylogenetic relationship of organisms from a bunch of sequences. Multiple sequence alignment (MSA) is necessary for phylogenetic reconstruction, but also for many other kinds of evolutionary analyses, such as motifs detection, protein structure prediction. In this thesis, we address the challenge of accurately and efficiently constructing multiple sequence alignment for a large number of sequences. We introduce a novel approach Guide Tree Decomposition and Outliers Detection (DAO), which adapts the decomposition technique since it makes alignment of ultra-large datasets possible. In decomposition-based methods, it divides sequences into subsets using a guide tree, obtains alignment (sub-MSA) for each subset, and then merges these alignments. Decomposition is a recursive step and it does not stop until the size of the current tree is small enough, e.g. smaller than 1,800 sequences. The main improvement of DAO is that sequences are divided into subsets following the centroid path of the guide tree, named centroid path decomposition. Merging sub-MSAs hierarchically follows the reverse order of edge removals during the previous decomposition step. Unlike the methods implemented in some existing tools, after centroid path decomposition, we guarantee that the merging follows the order of the original guide tree. For families with tens of thousands of sequences, outlier sequences may present and disrupt the construction of multiple sequence alignment. Outliers in a family may come from a variety of sources. They are rather different from the majority of sequences and can be distinguished according to their positions in the guide tree. We propose a simple method to deal with the problem of outliers in large scale datasets. In DAO, sequences in small branches of the guide tree, e.g., subset with less than 300 sequences, are been determined as outliers during decomposition step. The proposed approach DAO combines centroid path decomposition and outliers detection technique, and evaluation experiments are carried out on standard protein benchmark databases, including HomFam and OXFam. It shows that DAO gains a clear advantage in accuracy on the benchmark databases experiments. All-to-all pairwise Dynamic Programming (DP) calculation is used to improve the accuracy of MSA, in which a relatively rigorous guide tree is built. Unfortunately, the accurate algorithm is not scalable due to unpractical computational resource requirement for datasets with tens of thousands of sequences. We extend the proposed method DAO by constructing a more accurate guide tree using an external phylogenetic tree software Fasttree-2, named DAO_ml. Approximate maximum likelihood is used in Fasttree-2 to generate a phylogenetic tree with alignment input from existing tools. The remaining steps of DAO ml’s algorithm are consistent with the algorithm of DAO. Experi- mental results show that DAO_ml significant improves the accuracy of DAO. Although alignment quality of DAO_ml is not so good as the accurate algorithm mentioned above, DAO_ml shows a big advantage in the requirement of computational resource.
Degree	Master of Philosophy
Subject	Sequence alignment (Bioinformatics)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/241433
HKU Library Item ID	b5864198

DC Field	Value	Language
dc.contributor.author	Xie, Fangping	-
dc.contributor.author	谢芳萍	-
dc.date.accessioned	2017-06-13T02:07:52Z	-
dc.date.available	2017-06-13T02:07:52Z	-
dc.date.issued	2016	-
dc.identifier.citation	Xie, F. [谢芳萍]. (2016). Improving the quality of large scale multiple sequence alignment. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/241433	-
dc.description.abstract	Biological sequences have rapidly accumulated due to the development of sequencing technologies. For evolutionary biology, a fundamental question is to reconstruct the phylogenetic relationship of organisms from a bunch of sequences. Multiple sequence alignment (MSA) is necessary for phylogenetic reconstruction, but also for many other kinds of evolutionary analyses, such as motifs detection, protein structure prediction. In this thesis, we address the challenge of accurately and efficiently constructing multiple sequence alignment for a large number of sequences. We introduce a novel approach Guide Tree Decomposition and Outliers Detection (DAO), which adapts the decomposition technique since it makes alignment of ultra-large datasets possible. In decomposition-based methods, it divides sequences into subsets using a guide tree, obtains alignment (sub-MSA) for each subset, and then merges these alignments. Decomposition is a recursive step and it does not stop until the size of the current tree is small enough, e.g. smaller than 1,800 sequences. The main improvement of DAO is that sequences are divided into subsets following the centroid path of the guide tree, named centroid path decomposition. Merging sub-MSAs hierarchically follows the reverse order of edge removals during the previous decomposition step. Unlike the methods implemented in some existing tools, after centroid path decomposition, we guarantee that the merging follows the order of the original guide tree. For families with tens of thousands of sequences, outlier sequences may present and disrupt the construction of multiple sequence alignment. Outliers in a family may come from a variety of sources. They are rather different from the majority of sequences and can be distinguished according to their positions in the guide tree. We propose a simple method to deal with the problem of outliers in large scale datasets. In DAO, sequences in small branches of the guide tree, e.g., subset with less than 300 sequences, are been determined as outliers during decomposition step. The proposed approach DAO combines centroid path decomposition and outliers detection technique, and evaluation experiments are carried out on standard protein benchmark databases, including HomFam and OXFam. It shows that DAO gains a clear advantage in accuracy on the benchmark databases experiments. All-to-all pairwise Dynamic Programming (DP) calculation is used to improve the accuracy of MSA, in which a relatively rigorous guide tree is built. Unfortunately, the accurate algorithm is not scalable due to unpractical computational resource requirement for datasets with tens of thousands of sequences. We extend the proposed method DAO by constructing a more accurate guide tree using an external phylogenetic tree software Fasttree-2, named DAO_ml. Approximate maximum likelihood is used in Fasttree-2 to generate a phylogenetic tree with alignment input from existing tools. The remaining steps of DAO ml’s algorithm are consistent with the algorithm of DAO. Experi- mental results show that DAO_ml significant improves the accuracy of DAO. Although alignment quality of DAO_ml is not so good as the accurate algorithm mentioned above, DAO_ml shows a big advantage in the requirement of computational resource.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.subject.lcsh	Sequence alignment (Bioinformatics)	-
dc.title	Improving the quality of large scale multiple sequence alignment	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5864198	-
dc.description.thesisname	Master of Philosophy	-
dc.description.thesislevel	Master	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.mmsid	991026391209703414	-

File Download

Supplementary

postgraduate thesis: Improving the quality of large scale multiple sequence alignment

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats