Using a multi-staged strategy based on machine learning and mathematical modeling to predict genotype-phenotype risk patterns in diabetic kidney disease: A prospective case-control cohort analysis

Leung, Ross K K; Wang, Ying; Ma, Ronald C W; Luk, Andrea O Y; Lam, Vincent; Ng, Maggie; So, Wing Yee; Tsui, Stephen K W; Chan, Juliana C N

File Download

Content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1186/1471-2369-14-162
Scopus: eid_2-s2.0-84880335873
PMID: 23879411
WOS: WOS:000322376400001

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
- PubMed Central: 0
Appears in Collections:
- Public Health: Journal/Magazine Articles

Article: Using a multi-staged strategy based on machine learning and mathematical modeling to predict genotype-phenotype risk patterns in diabetic kidney disease: A prospective case-control cohort analysis

Title	Using a multi-staged strategy based on machine learning and mathematical modeling to predict genotype-phenotype risk patterns in diabetic kidney disease: A prospective case-control cohort analysis
Authors	Leung, Ross K K Wang, Ying Ma, Ronald C W Luk, Andrea O Y Lam, Vincent Ng, Maggie So, Wing Yee Tsui, Stephen K W Chan, Juliana C N
Keywords	Random forest Support vector machine Diabetic kidney disease Machine learning Genotypes Phenotypes Prediction
Issue Date	2013
Citation	BMC Nephrology, 2013, v. 14, n. 1 How to Cite? DOI: http://dx.doi.org/10.1186/1471-2369-14-162
Abstract	Background: Multi-causality and heterogeneity of phenotypes and genotypes characterize complex diseases. In a database with comprehensive collection of phenotypes and genotypes, we compared the performance of common machine learning methods to generate mathematical models to predict diabetic kidney disease (DKD). Methods. In a prospective cohort of type 2 diabetic patients, we selected 119 subjects with DKD and 554 without DKD at enrolment and after a median follow-up period of 7.8 years for model training, testing and validation using seven machine learning methods (partial least square regression, the classification and regression tree, the C5.0 decision tree, random forest, naïve Bayes classification, neural network and support vector machine). We used 17 clinical attributes and 70 single nucleotide polymorphisms (SNPs) of 54 candidate genes to build different models. The top attributes selected by the best-performing models were then used to build models with performance comparable to those using the entire dataset. Results: Age, age of diagnosis, systolic blood pressure and genetic polymorphisms of uteroglobin and lipid metabolism were selected by most methods. Models generated by support vector machine (svmRadial) and random forest (cforest) had the best prediction accuracy whereas models derived from naïve Bayes classifier and partial least squares regression had the least optimal performance. Using 10 clinical attributes (systolic and diastolic blood pressure, age, age of diagnosis, triglyceride, white blood cell count, total cholesterol, waist to hip ratio, LDL cholesterol, and alcohol intake) and 5 genetic attributes (UGB G38A, LIPC -514C > T, APOB Thr71Ile, APOC3 3206T > G and APOC3 1100C > T), selected most often by SVM and cforest, we were able to build high-performance models. Conclusions: Amongst different machine learning methods, svmRadial and cforest had the best performance. Genetic polymorphisms related to inflammation and lipid metabolism warrant further investigation for their associations with DKD. © 2013 Leung et al.; licensee BioMed Central Ltd.
Persistent Identifier	http://hdl.handle.net/10722/222143
ISI Accession Number ID	WOS:000322376400001

DC Field	Value	Language
dc.contributor.author	Leung, Ross K K	-
dc.contributor.author	Wang, Ying	-
dc.contributor.author	Ma, Ronald C W	-
dc.contributor.author	Luk, Andrea O Y	-
dc.contributor.author	Lam, Vincent	-
dc.contributor.author	Ng, Maggie	-
dc.contributor.author	So, Wing Yee	-
dc.contributor.author	Tsui, Stephen K W	-
dc.contributor.author	Chan, Juliana C N	-
dc.date.accessioned	2015-12-21T06:48:52Z	-
dc.date.available	2015-12-21T06:48:52Z	-
dc.date.issued	2013	-
dc.identifier.citation	BMC Nephrology, 2013, v. 14, n. 1	-
dc.identifier.uri	http://hdl.handle.net/10722/222143	-
dc.description.abstract	Background: Multi-causality and heterogeneity of phenotypes and genotypes characterize complex diseases. In a database with comprehensive collection of phenotypes and genotypes, we compared the performance of common machine learning methods to generate mathematical models to predict diabetic kidney disease (DKD). Methods. In a prospective cohort of type 2 diabetic patients, we selected 119 subjects with DKD and 554 without DKD at enrolment and after a median follow-up period of 7.8 years for model training, testing and validation using seven machine learning methods (partial least square regression, the classification and regression tree, the C5.0 decision tree, random forest, naïve Bayes classification, neural network and support vector machine). We used 17 clinical attributes and 70 single nucleotide polymorphisms (SNPs) of 54 candidate genes to build different models. The top attributes selected by the best-performing models were then used to build models with performance comparable to those using the entire dataset. Results: Age, age of diagnosis, systolic blood pressure and genetic polymorphisms of uteroglobin and lipid metabolism were selected by most methods. Models generated by support vector machine (svmRadial) and random forest (cforest) had the best prediction accuracy whereas models derived from naïve Bayes classifier and partial least squares regression had the least optimal performance. Using 10 clinical attributes (systolic and diastolic blood pressure, age, age of diagnosis, triglyceride, white blood cell count, total cholesterol, waist to hip ratio, LDL cholesterol, and alcohol intake) and 5 genetic attributes (UGB G38A, LIPC -514C > T, APOB Thr71Ile, APOC3 3206T > G and APOC3 1100C > T), selected most often by SVM and cforest, we were able to build high-performance models. Conclusions: Amongst different machine learning methods, svmRadial and cforest had the best performance. Genetic polymorphisms related to inflammation and lipid metabolism warrant further investigation for their associations with DKD. © 2013 Leung et al.; licensee BioMed Central Ltd.	-
dc.language	eng	-
dc.relation.ispartof	BMC Nephrology	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	Random forest	-
dc.subject	Support vector machine	-
dc.subject	Diabetic kidney disease	-
dc.subject	Machine learning	-
dc.subject	Genotypes	-
dc.subject	Phenotypes	-
dc.subject	Prediction	-
dc.title	Using a multi-staged strategy based on machine learning and mathematical modeling to predict genotype-phenotype risk patterns in diabetic kidney disease: A prospective case-control cohort analysis	-
dc.type	Article	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.1186/1471-2369-14-162	-
dc.identifier.pmid	23879411	-
dc.identifier.scopus	eid_2-s2.0-84880335873	-
dc.identifier.volume	14	-
dc.identifier.issue	1	-
dc.identifier.eissn	1471-2369	-
dc.identifier.isi	WOS:000322376400001	-
dc.identifier.issnl	1471-2369	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Using a multi-staged strategy based on machine learning and mathematical modeling to predict genotype-phenotype risk patterns in diabetic kidney disease: A prospective case-control cohort analysis

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats