Communication efficient distributed kernel principal component analysis

Balcan, Maria Florina; Liang, Yingyu; Song, Le; Woodruff, David; Xie, Bo

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/2939672.2939796
Scopus: eid_2-s2.0-84985032082
WOS: WOS:000485529800086

Supplementary

Citations:
- Scopus: 22
- Web of Science: 0
Appears in Collections:
- HKU Musketeers Foundation Institute of Data Science: Conference papers

Conference Paper: Communication efficient distributed kernel principal component analysis

Title	Communication efficient distributed kernel principal component analysis
Authors	Balcan, Maria Florina Liang, Yingyu Song, Le Woodruff, David Xie, Bo
Keywords	Distributed Kernel method Principal component analysis
Issue Date	2016
Citation	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, v. 13-17-August-2016, p. 725-734 How to Cite? DOI: http://dx.doi.org/10.1145/2939672.2939796
Abstract	Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality? In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of distributed datasets, and compute a set of global kernel principal components with relative error guarantees independent of the dimension of the feature space or the total number of data points. In particular, computing k principal components with relative error ϵ over s workers has communication cost O(sρk=ϵ + sk2=ϵ3) words, where ρ is the average number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real world datasets. The experimental results showed that the algorithm produces a high quality kernel PCA solution while using significantly less communication than alternative approaches. computing.
Persistent Identifier	http://hdl.handle.net/10722/341189
ISI Accession Number ID	WOS:000485529800086

DC Field	Value	Language
dc.contributor.author	Balcan, Maria Florina	-
dc.contributor.author	Liang, Yingyu	-
dc.contributor.author	Song, Le	-
dc.contributor.author	Woodruff, David	-
dc.contributor.author	Xie, Bo	-
dc.date.accessioned	2024-03-13T08:40:52Z	-
dc.date.available	2024-03-13T08:40:52Z	-
dc.date.issued	2016	-
dc.identifier.citation	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, v. 13-17-August-2016, p. 725-734	-
dc.identifier.uri	http://hdl.handle.net/10722/341189	-
dc.description.abstract	Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality? In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of distributed datasets, and compute a set of global kernel principal components with relative error guarantees independent of the dimension of the feature space or the total number of data points. In particular, computing k principal components with relative error ϵ over s workers has communication cost O(sρk=ϵ + sk2=ϵ3) words, where ρ is the average number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real world datasets. The experimental results showed that the algorithm produces a high quality kernel PCA solution while using significantly less communication than alternative approaches. computing.	-
dc.language	eng	-
dc.relation.ispartof	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining	-
dc.subject	Distributed	-
dc.subject	Kernel method	-
dc.subject	Principal component analysis	-
dc.title	Communication efficient distributed kernel principal component analysis	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1145/2939672.2939796	-
dc.identifier.scopus	eid_2-s2.0-84985032082	-
dc.identifier.volume	13-17-August-2016	-
dc.identifier.spage	725	-
dc.identifier.epage	734	-
dc.identifier.isi	WOS:000485529800086	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Communication efficient distributed kernel principal component analysis

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats