File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Some non-convex statistical methods for complex data analysis
Title | Some non-convex statistical methods for complex data analysis |
---|---|
Authors | |
Issue Date | 2021 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Zhang, X. [张晓昱]. (2021). Some non-convex statistical methods for complex data analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | A broad spectrum of scientific research, such as economics, genomics, neuroscience, and climatology, involves discovering low-dimensional structures in high-dimensional datasets to provide accurate prediction and reliable inference. This motivates extraordinary advances in various sparse and low-rank modeling methods based on convex relaxation. However, many existing convex methods are not scalable to high-dimensional massive datasets. Inspired by the remarkable empirical success of many non-convex methods, this thesis aims to develop novel non-convex statistical methods with rigorous theoretical guarantees and computationally efficient algorithms for complex data analysis.
The first part of this thesis focuses on identifying and modeling the heterogeneous characteristics of the panel data. Many panel data have the latent subgroup effect on individuals, and it is essential to identify these groups to improve estimation efficiency. However, the widely-used parametric and semiparametric models between the response and predictors may be misspecified, and the nonparametric approach can avoid such mistakes. Moreover, the response may depend on predictors in different ways at various quantile levels, and the corresponding grouping structure may also vary. To tackle these problems, this thesis proposes a nonparametric quantile regression method for homogeneity pursuit in panel data models with fixed individual effects, and a pairwise fused penalty is used to automatically select the number of groups. The asymptotic properties are established and a scalable algorithm is also developed.
The second part of this thesis proposes a new reduced-rank autoregressive model for high-dimensional time series analysis by introducing a common structure between the subspaces of responses and predictors. Compared with the original reduced-rank models, the proposed model achieves further dimension reduction and provides appealing factor interpretations of the dynamic structure. Under high-dimensional scaling, this thesis proposes a non-convex estimation framework built upon spectral initialization and iterative gradient descent refinements, and develops an integrated analysis with computational and statistical convergence guarantees. A data-driven selection procedure is proposed to consistently estimate the rank and common subspace dimension.
Third, in many biomedical studies, such as high-throughput microarray or RNA-sequencing gene expression analyses, it is of practical interest to link gene expression profiles to censored survival phenotypes. With the number of genes greatly exceeding the sample size and the nuances of survival data such as right censoring, regularized methods that combine the rank-based loss function and the penalty are often used to identify relevant prognostic biomarkers and yield parsimonious prediction models for event times. Existing penalization methods for survival data use l1 regularization to induce sparsity, leading to inflated model size. This thesis considers sparsity-restricted estimation in the accelerated failure time model for censored survival data. An efficient and fast two-stage procedure that uses a regularized Gehan rank regression and a simple hard-thresholding estimation is proposed for its numerical implementation.
Extensive simulation experiments verify the statistical and computational properties of all proposed methodologies, and real applications demonstrate the promising performance of the proposed methods over various existing competitors. |
Degree | Doctor of Philosophy |
Subject | Data mining |
Dept/Program | Statistics and Actuarial Science |
Persistent Identifier | http://hdl.handle.net/10722/325759 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Zhang, Xiaoyu | - |
dc.contributor.author | 张晓昱 | - |
dc.date.accessioned | 2023-03-02T16:32:36Z | - |
dc.date.available | 2023-03-02T16:32:36Z | - |
dc.date.issued | 2021 | - |
dc.identifier.citation | Zhang, X. [张晓昱]. (2021). Some non-convex statistical methods for complex data analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/325759 | - |
dc.description.abstract | A broad spectrum of scientific research, such as economics, genomics, neuroscience, and climatology, involves discovering low-dimensional structures in high-dimensional datasets to provide accurate prediction and reliable inference. This motivates extraordinary advances in various sparse and low-rank modeling methods based on convex relaxation. However, many existing convex methods are not scalable to high-dimensional massive datasets. Inspired by the remarkable empirical success of many non-convex methods, this thesis aims to develop novel non-convex statistical methods with rigorous theoretical guarantees and computationally efficient algorithms for complex data analysis. The first part of this thesis focuses on identifying and modeling the heterogeneous characteristics of the panel data. Many panel data have the latent subgroup effect on individuals, and it is essential to identify these groups to improve estimation efficiency. However, the widely-used parametric and semiparametric models between the response and predictors may be misspecified, and the nonparametric approach can avoid such mistakes. Moreover, the response may depend on predictors in different ways at various quantile levels, and the corresponding grouping structure may also vary. To tackle these problems, this thesis proposes a nonparametric quantile regression method for homogeneity pursuit in panel data models with fixed individual effects, and a pairwise fused penalty is used to automatically select the number of groups. The asymptotic properties are established and a scalable algorithm is also developed. The second part of this thesis proposes a new reduced-rank autoregressive model for high-dimensional time series analysis by introducing a common structure between the subspaces of responses and predictors. Compared with the original reduced-rank models, the proposed model achieves further dimension reduction and provides appealing factor interpretations of the dynamic structure. Under high-dimensional scaling, this thesis proposes a non-convex estimation framework built upon spectral initialization and iterative gradient descent refinements, and develops an integrated analysis with computational and statistical convergence guarantees. A data-driven selection procedure is proposed to consistently estimate the rank and common subspace dimension. Third, in many biomedical studies, such as high-throughput microarray or RNA-sequencing gene expression analyses, it is of practical interest to link gene expression profiles to censored survival phenotypes. With the number of genes greatly exceeding the sample size and the nuances of survival data such as right censoring, regularized methods that combine the rank-based loss function and the penalty are often used to identify relevant prognostic biomarkers and yield parsimonious prediction models for event times. Existing penalization methods for survival data use l1 regularization to induce sparsity, leading to inflated model size. This thesis considers sparsity-restricted estimation in the accelerated failure time model for censored survival data. An efficient and fast two-stage procedure that uses a regularized Gehan rank regression and a simple hard-thresholding estimation is proposed for its numerical implementation. Extensive simulation experiments verify the statistical and computational properties of all proposed methodologies, and real applications demonstrate the promising performance of the proposed methods over various existing competitors. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Data mining | - |
dc.title | Some non-convex statistical methods for complex data analysis | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Statistics and Actuarial Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2022 | - |
dc.identifier.mmsid | 991044649903903414 | - |