Dissecting Gene Expression Heterogeneity: Generalized Pearson Correlation Squares and the K-Lines Clustering Algorithm

Li, Jingyi Jessica; Zhou, Heather J.; Bickel, Peter J.; Tong, Xin

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1080/01621459.2024.2342639
Scopus: eid_2-s2.0-85193986933
WOS: WOS:001230673400001
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Faculty of Business & Economics: Journal/Magazine Articles

Article: Dissecting Gene Expression Heterogeneity: Generalized Pearson Correlation Squares and the K-Lines Clustering Algorithm

Title	Dissecting Gene Expression Heterogeneity: Generalized Pearson Correlation Squares and the K-Lines Clustering Algorithm
Authors	Li, Jingyi Jessica Zhou, Heather J.Bickel, Peter J.Tong, Xin
Keywords	Asymptotic distribution Mixture of linear dependences Specified and unspecified generalized Pearson correlation squares
Issue Date	2024
Citation	Journal of the American Statistical Association, 2024, v. 119, n. 548, p. 2450-2463 How to Cite? DOI: http://dx.doi.org/10.1080/01621459.2024.2342639
Abstract	Motivated by the pressing needs for dissecting heterogeneous relationships in gene expression data, here we generalize the squared Pearson correlation to capture a mixture of linear dependences between two real-valued variables, with or without an index variable that specifies the line memberships. We construct the generalized Pearson correlation squares by focusing on three aspects: variable exchangeability, no parametric model assumptions, and inference of population-level parameters. To compute the generalized Pearson correlation square from a sample without a line-membership specification, we develop a K-lines clustering algorithm to find K clusters that exhibit distinct linear dependences, where K can be chosen in a data-adaptive way. To infer the population-level generalized Pearson correlation squares, we derive the asymptotic distributions of the sample-level statistics to enable efficient statistical inference. Simulation studies verify the theoretical results and show the power advantage of the generalized Pearson correlation squares in capturing mixtures of linear dependences. Gene expression data analyses demonstrate the effectiveness of the generalized Pearson correlation squares and the K-lines clustering algorithm in dissecting complex but interpretable relationships. The estimation and inference procedures are implemented in the R package gR2 (https://github.com/lijy03/gR2). Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
Persistent Identifier	http://hdl.handle.net/10722/354335
ISSN	0162-1459 2023 Impact Factor: 3.0 2023 SCImago Journal Rankings: 3.922
ISI Accession Number ID	WOS:001230673400001

DC Field	Value	Language
dc.contributor.author	Li, Jingyi Jessica	-
dc.contributor.author	Zhou, Heather J.	-
dc.contributor.author	Bickel, Peter J.	-
dc.contributor.author	Tong, Xin	-
dc.date.accessioned	2025-02-07T08:47:58Z	-
dc.date.available	2025-02-07T08:47:58Z	-
dc.date.issued	2024	-
dc.identifier.citation	Journal of the American Statistical Association, 2024, v. 119, n. 548, p. 2450-2463	-
dc.identifier.issn	0162-1459	-
dc.identifier.uri	http://hdl.handle.net/10722/354335	-
dc.description.abstract	Motivated by the pressing needs for dissecting heterogeneous relationships in gene expression data, here we generalize the squared Pearson correlation to capture a mixture of linear dependences between two real-valued variables, with or without an index variable that specifies the line memberships. We construct the generalized Pearson correlation squares by focusing on three aspects: variable exchangeability, no parametric model assumptions, and inference of population-level parameters. To compute the generalized Pearson correlation square from a sample without a line-membership specification, we develop a K-lines clustering algorithm to find K clusters that exhibit distinct linear dependences, where K can be chosen in a data-adaptive way. To infer the population-level generalized Pearson correlation squares, we derive the asymptotic distributions of the sample-level statistics to enable efficient statistical inference. Simulation studies verify the theoretical results and show the power advantage of the generalized Pearson correlation squares in capturing mixtures of linear dependences. Gene expression data analyses demonstrate the effectiveness of the generalized Pearson correlation squares and the K-lines clustering algorithm in dissecting complex but interpretable relationships. The estimation and inference procedures are implemented in the R package gR2 (https://github.com/lijy03/gR2). Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.	-
dc.language	eng	-
dc.relation.ispartof	Journal of the American Statistical Association	-
dc.subject	Asymptotic distribution	-
dc.subject	Mixture of linear dependences	-
dc.subject	Specified and unspecified generalized Pearson correlation squares	-
dc.title	Dissecting Gene Expression Heterogeneity: Generalized Pearson Correlation Squares and the K-Lines Clustering Algorithm	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1080/01621459.2024.2342639	-
dc.identifier.scopus	eid_2-s2.0-85193986933	-
dc.identifier.volume	119	-
dc.identifier.issue	548	-
dc.identifier.spage	2450	-
dc.identifier.epage	2463	-
dc.identifier.eissn	1537-274X	-
dc.identifier.isi	WOS:001230673400001	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Dissecting Gene Expression Heterogeneity: Generalized Pearson Correlation Squares and the K-Lines Clustering Algorithm

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats