File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Statistical learning by embedding data into computational graphs
Title | Statistical learning by embedding data into computational graphs |
---|---|
Authors | |
Issue Date | 2022 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Gu, J.. (2022). Statistical learning by embedding data into computational graphs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | As the dimension of real-world data continues to grow in the era of big data, various techniques have been proposed to obtain embeddings with important data information. By mapping observations to embeddings with either smaller dimension or noise, the signal-to-noise ratio of data is increased, leading to statistical learning approaches with higher statistical and computational efficiency. This thesis introduces a new framework of data embedding techniques. By transferring observations as records or nodes of a computational graph, we develop several statistical learning methods to make statistical inference with network structures of different types of data.
In the first part, a nonparametric nodes clustering approach with node covariates is developed for relational data. By defining the triangular concordance index between links and latent positions of nodes, we propose the triangular concordance learning to estimate the latent positions by maximizing the penalized triangular concordance function. Without prespecified number of clusters, the fused penalty shrinks node-specific centers of nodes with similar link pattern together and provides an estimated community structure of nodes. In addition, an individualized criterion for linkage of nodes is also obtained to predict unobserved or future links in a nonparametric way.
In the second part, we discuss the efficient computation of maximum likelihood estimation (MLE) under generalized multinomial models. From the necessary condition that the gradient of log-likelihood function equals $0$ at the MLE, we theoretically show that the MLE corresponds to the stationary distribution of an inhomogeneous Markov chain indexed by the MLE itself. Therefore, observations under generalized multinomial models are interpreted as win-loss records of a tournament network and a Markov chain based algorithm is developed to compute the MLE computationally efficiently.
In the third part, we suggest that the Delaunay triangulation implies a geometry-based network structure of datasets with the highest level of smoothness. Based on such interpretation, we incorporate the Delaunay triangulation into nonparametric regression and develop the crystallization learning to estimate conditional expectation function with computational efficiency. Compared to existing approaches, the crystallization learning and its variants can select neighbor data points uniformly in all directions and thus are robust to the local geometric structure of the data, leading to better estimation performance on both synthetic and real data.
In the final part, a greedy algorithm is developed to compress 3D point cloud data with triangular network structure. Based on a local retriangulation method, which utilizes the network structure to fill in the resultant hole and compute the information loss caused by the removal of each point, the proposed algorithm progressively removes the least informative point so that local features of the network structure are polished earlier than global features. In addition, a rank-based procedure is proposed to detect the change point of information loss throughout iterations and used to select the optimal compression rate with the approximation quality of the network structure maintained. |
Degree | Doctor of Philosophy |
Subject | Mathematical statistics - Data processing Computer graphics |
Dept/Program | Statistics and Actuarial Science |
Persistent Identifier | http://hdl.handle.net/10722/325767 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Gu, Jiaqi | - |
dc.date.accessioned | 2023-03-02T16:32:40Z | - |
dc.date.available | 2023-03-02T16:32:40Z | - |
dc.date.issued | 2022 | - |
dc.identifier.citation | Gu, J.. (2022). Statistical learning by embedding data into computational graphs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/325767 | - |
dc.description.abstract | As the dimension of real-world data continues to grow in the era of big data, various techniques have been proposed to obtain embeddings with important data information. By mapping observations to embeddings with either smaller dimension or noise, the signal-to-noise ratio of data is increased, leading to statistical learning approaches with higher statistical and computational efficiency. This thesis introduces a new framework of data embedding techniques. By transferring observations as records or nodes of a computational graph, we develop several statistical learning methods to make statistical inference with network structures of different types of data. In the first part, a nonparametric nodes clustering approach with node covariates is developed for relational data. By defining the triangular concordance index between links and latent positions of nodes, we propose the triangular concordance learning to estimate the latent positions by maximizing the penalized triangular concordance function. Without prespecified number of clusters, the fused penalty shrinks node-specific centers of nodes with similar link pattern together and provides an estimated community structure of nodes. In addition, an individualized criterion for linkage of nodes is also obtained to predict unobserved or future links in a nonparametric way. In the second part, we discuss the efficient computation of maximum likelihood estimation (MLE) under generalized multinomial models. From the necessary condition that the gradient of log-likelihood function equals $0$ at the MLE, we theoretically show that the MLE corresponds to the stationary distribution of an inhomogeneous Markov chain indexed by the MLE itself. Therefore, observations under generalized multinomial models are interpreted as win-loss records of a tournament network and a Markov chain based algorithm is developed to compute the MLE computationally efficiently. In the third part, we suggest that the Delaunay triangulation implies a geometry-based network structure of datasets with the highest level of smoothness. Based on such interpretation, we incorporate the Delaunay triangulation into nonparametric regression and develop the crystallization learning to estimate conditional expectation function with computational efficiency. Compared to existing approaches, the crystallization learning and its variants can select neighbor data points uniformly in all directions and thus are robust to the local geometric structure of the data, leading to better estimation performance on both synthetic and real data. In the final part, a greedy algorithm is developed to compress 3D point cloud data with triangular network structure. Based on a local retriangulation method, which utilizes the network structure to fill in the resultant hole and compute the information loss caused by the removal of each point, the proposed algorithm progressively removes the least informative point so that local features of the network structure are polished earlier than global features. In addition, a rank-based procedure is proposed to detect the change point of information loss throughout iterations and used to select the optimal compression rate with the approximation quality of the network structure maintained. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Mathematical statistics - Data processing | - |
dc.subject.lcsh | Computer graphics | - |
dc.title | Statistical learning by embedding data into computational graphs | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Statistics and Actuarial Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2022 | - |
dc.identifier.mmsid | 991044649996203414 | - |