Predicting functional impact of nonsynonymous mutations by quantifying conservation information and detect indels using split-read approach

Zeng, Shuai; 曾帥

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_b5204922

Supplementary

Citations:
Appears in Collections:
- Paediatrics & Adolescent Medicine: Theses
- HKU Theses Online

postgraduate thesis: Predicting functional impact of nonsynonymous mutations by quantifying conservation information and detect indels using split-read approach

Title	Predicting functional impact of nonsynonymous mutations by quantifying conservation information and detect indels using split-read approach
Authors	Zeng, Shuai 曾帥
Advisors	Advisor(s):Lau, YL Yang, W
Issue Date	2014
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Zeng, S. [曾帥]. (2014). Predicting functional impact of nonsynonymous mutations by quantifying conservation information and detect indels using split-read approach. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5204922
Abstract	The rapidly developing sequencing technology has brought up an opportunity to scientists to look into the detailed genotype information in human genome. Computational programs have played important roles in identifying disease related genomic variants from huge amount of sequencing data. In the past years, a number of computational algorithms have been developed, solving many crucial problems in sequencing data analysis, such as mapping sequencing reads to genome and identifying SNPs. However, many difficult and important issues are still expecting satisfactory solutions. A key challenge is identifying disease related mutations in the background of non-pathogenic polymorphisms. Another crucial problem is detecting INDELs especially the long deletions under the technical limitations of second generation sequencing technology. To predict disease related mutations, we developed a machine learning-based (Random forests) prediction tool, EFIN (Evaluation of Functional Impact of Nonsynonymous mutations). We build A Multiple Sequence Alignment (MSA) for a querying protein with its homologous sequences. MSA is later divided into different blocks according to taxonomic information of the sequences. After that, we quantified the conservation in each block using a number of selected features, for example, entropy, a concept borrowed from information theory. EFIN was trained by Swiss-Prot and HumDiv datasets. By a series of fair comparisons, EFIN showed better results than the widely-used algorithms in terms of AUC (Area under ROC curve), accuracy, specificity and sensitivity. The web-based database is provided to worldwide user at paed.hku.hk/efin. To solve the second problem, we developed Linux-based software, SPLindel that detects deletions (especially long deletions) and insertions using second generation sequencing data. For each sample, SPLindel uses split-read method to detect the candidate INDELs by building alternative references to go along with the reference sequences. And then we remap all the relevant reads using both original references and alternative allele references. A Bayesian model integrating paired-end information was used to assign the reads to the most likely locations on either the original reference allele or the alternative allele. Finally we count the number of reads that support the alternative allele (with insertion or deletions comparing to the original reference allele) and the original allele, and fit a beta-binomial mixture model. Based on this model, the likelihood for each INDEL is calculated and the genotype is predicted. SPLindel runs about the same speed as GATK and DINDEL, but much faster than DINDEL. SPLindel obtained very similar results as GATK and DINDEL for the INDELs of size 1-15 bps, but is much more effective in detecting INDELs of larger size. Using machine learning method and statistical modeling technology, we proposed the tools to solve these two important problems in sequencing data analysis. This study will help identify novel damaging nsSNPs more accurately and efficiently, and equip researcher with more powerful tool in identifying INDELs, especially long deletions. As more and more sequencing data are generated, methods and tools introduced in this thesis may help us extract useful information to facilitate identification of causal mutations to human diseases.
Degree	Doctor of Philosophy
Subject	Medical genetics Mutation (Biology) Nucleotide sequence - Data processing
Dept/Program	Paediatrics and Adolescent Medicine
Persistent Identifier	http://hdl.handle.net/10722/198818
HKU Library Item ID	b5204922

DC Field	Value	Language
dc.contributor.advisor	Lau, YL	-
dc.contributor.advisor	Yang, W	-
dc.contributor.author	Zeng, Shuai	-
dc.contributor.author	曾帥	-
dc.date.accessioned	2014-07-10T04:10:17Z	-
dc.date.available	2014-07-10T04:10:17Z	-
dc.date.issued	2014	-
dc.identifier.citation	Zeng, S. [曾帥]. (2014). Predicting functional impact of nonsynonymous mutations by quantifying conservation information and detect indels using split-read approach. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5204922	-
dc.identifier.uri	http://hdl.handle.net/10722/198818	-
dc.description.abstract	The rapidly developing sequencing technology has brought up an opportunity to scientists to look into the detailed genotype information in human genome. Computational programs have played important roles in identifying disease related genomic variants from huge amount of sequencing data. In the past years, a number of computational algorithms have been developed, solving many crucial problems in sequencing data analysis, such as mapping sequencing reads to genome and identifying SNPs. However, many difficult and important issues are still expecting satisfactory solutions. A key challenge is identifying disease related mutations in the background of non-pathogenic polymorphisms. Another crucial problem is detecting INDELs especially the long deletions under the technical limitations of second generation sequencing technology. To predict disease related mutations, we developed a machine learning-based (Random forests) prediction tool, EFIN (Evaluation of Functional Impact of Nonsynonymous mutations). We build A Multiple Sequence Alignment (MSA) for a querying protein with its homologous sequences. MSA is later divided into different blocks according to taxonomic information of the sequences. After that, we quantified the conservation in each block using a number of selected features, for example, entropy, a concept borrowed from information theory. EFIN was trained by Swiss-Prot and HumDiv datasets. By a series of fair comparisons, EFIN showed better results than the widely-used algorithms in terms of AUC (Area under ROC curve), accuracy, specificity and sensitivity. The web-based database is provided to worldwide user at paed.hku.hk/efin. To solve the second problem, we developed Linux-based software, SPLindel that detects deletions (especially long deletions) and insertions using second generation sequencing data. For each sample, SPLindel uses split-read method to detect the candidate INDELs by building alternative references to go along with the reference sequences. And then we remap all the relevant reads using both original references and alternative allele references. A Bayesian model integrating paired-end information was used to assign the reads to the most likely locations on either the original reference allele or the alternative allele. Finally we count the number of reads that support the alternative allele (with insertion or deletions comparing to the original reference allele) and the original allele, and fit a beta-binomial mixture model. Based on this model, the likelihood for each INDEL is calculated and the genotype is predicted. SPLindel runs about the same speed as GATK and DINDEL, but much faster than DINDEL. SPLindel obtained very similar results as GATK and DINDEL for the INDELs of size 1-15 bps, but is much more effective in detecting INDELs of larger size. Using machine learning method and statistical modeling technology, we proposed the tools to solve these two important problems in sequencing data analysis. This study will help identify novel damaging nsSNPs more accurately and efficiently, and equip researcher with more powerful tool in identifying INDELs, especially long deletions. As more and more sequencing data are generated, methods and tools introduced in this thesis may help us extract useful information to facilitate identification of causal mutations to human diseases.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.subject.lcsh	Medical genetics	-
dc.subject.lcsh	Mutation (Biology)	-
dc.subject.lcsh	Nucleotide sequence - Data processing	-
dc.title	Predicting functional impact of nonsynonymous mutations by quantifying conservation information and detect indels using split-read approach	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5204922	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Paediatrics and Adolescent Medicine	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_b5204922	-
dc.identifier.mmsid	991036905439703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Predicting functional impact of nonsynonymous mutations by quantifying conservation information and detect indels using split-read approach

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats