File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Statistical analysis of RNA-seq and scRNA-seq expression data
Title | Statistical analysis of RNA-seq and scRNA-seq expression data |
---|---|
Authors | |
Advisors | |
Issue Date | 2018 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Yip, S. [葉信恆]. (2018). Statistical analysis of RNA-seq and scRNA-seq expression data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | RNA-seq is a popular technique that utilizes next-generation sequencing to obtain transcriptome information from a cell population. It generates short sequences of reads from the transcriptome, which can be utilized for gene annotation, expression quantification, fusion gene detection, differentially expressed gene (DEG) analysis, etc. This technology can be applied to single cells, which enables more in-depth study of the transcriptome. Single cell RNA-seq (scRNA-seq) enables highly variable transcript discovery, cell subpopulation analysis, etc., in addition to common RNA-seq data analyses. The analysis of RNA-seq data can be separated into two main categories. The first category focuses on the read sequences. This data allows the analysis of alternative splicing, gene annotation, post-transcriptional modifications, gene fusion, etc. The second category focuses on the expression data, which are obtained by counting the number of reads generated from each gene or transcript. Analysis of the expression data includes DEG analysis, highly variable transcript discovery, cell subpopulation analysis, etc. This thesis begins by briefly describing the backgrounds in RNA-seq analysis and the commonly utilized pipelines from the first category.
The main focus of this thesis is the statistical analysis of the expression data. RNA-seq analysis tools that analyze expression data can often perform DEG analysis; and they are previously shown to have advantages over each other in different aspects. For example, voom controls false positive rates well, DESeq2 is focused on precision and edgeR has an advantage in overall accuracy. This prompts the development a new method that can perform optimally in all of these aspects. On the other hand, scRNA-seq is a newer technology and many tools are developed recently. Compared to RNA-seq data, scRNA-seq expression matrices contain a higher amount of zero counts and their expression estimates are often less accurate. Hence, scRNA-seq analysis methods are often emphasized on technical noise reduction. Utilizing DEG analysis, which is a basic statistical test, existing scRNA-seq tools are shown to be inferior to existing RNA-seq methods in controlling false positive rates with real scRNA-seq data. To improve current analysis pipelines, the issue is pinpointed to the normalization and transformation step, which is crucial for the reduction of technical noises. The linear model and normality based normalization and transformation method (Linnorm) is developed to normalize and transform scRNA-seq data for statistical analyses. By using real RNA-seq and scRNA-seq data, Linnorm is compared with existing normalization methods and it shows improvements in multiple aspects.
|
Degree | Doctor of Philosophy |
Subject | Nucleotide sequence - Statistical methods |
Dept/Program | Biomedical Sciences |
Persistent Identifier | http://hdl.handle.net/10722/266318 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Sham, PC | - |
dc.contributor.advisor | Wang, JJ | - |
dc.contributor.author | Yip, Shun-hang | - |
dc.contributor.author | 葉信恆 | - |
dc.date.accessioned | 2019-01-18T01:52:02Z | - |
dc.date.available | 2019-01-18T01:52:02Z | - |
dc.date.issued | 2018 | - |
dc.identifier.citation | Yip, S. [葉信恆]. (2018). Statistical analysis of RNA-seq and scRNA-seq expression data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/266318 | - |
dc.description.abstract | RNA-seq is a popular technique that utilizes next-generation sequencing to obtain transcriptome information from a cell population. It generates short sequences of reads from the transcriptome, which can be utilized for gene annotation, expression quantification, fusion gene detection, differentially expressed gene (DEG) analysis, etc. This technology can be applied to single cells, which enables more in-depth study of the transcriptome. Single cell RNA-seq (scRNA-seq) enables highly variable transcript discovery, cell subpopulation analysis, etc., in addition to common RNA-seq data analyses. The analysis of RNA-seq data can be separated into two main categories. The first category focuses on the read sequences. This data allows the analysis of alternative splicing, gene annotation, post-transcriptional modifications, gene fusion, etc. The second category focuses on the expression data, which are obtained by counting the number of reads generated from each gene or transcript. Analysis of the expression data includes DEG analysis, highly variable transcript discovery, cell subpopulation analysis, etc. This thesis begins by briefly describing the backgrounds in RNA-seq analysis and the commonly utilized pipelines from the first category. The main focus of this thesis is the statistical analysis of the expression data. RNA-seq analysis tools that analyze expression data can often perform DEG analysis; and they are previously shown to have advantages over each other in different aspects. For example, voom controls false positive rates well, DESeq2 is focused on precision and edgeR has an advantage in overall accuracy. This prompts the development a new method that can perform optimally in all of these aspects. On the other hand, scRNA-seq is a newer technology and many tools are developed recently. Compared to RNA-seq data, scRNA-seq expression matrices contain a higher amount of zero counts and their expression estimates are often less accurate. Hence, scRNA-seq analysis methods are often emphasized on technical noise reduction. Utilizing DEG analysis, which is a basic statistical test, existing scRNA-seq tools are shown to be inferior to existing RNA-seq methods in controlling false positive rates with real scRNA-seq data. To improve current analysis pipelines, the issue is pinpointed to the normalization and transformation step, which is crucial for the reduction of technical noises. The linear model and normality based normalization and transformation method (Linnorm) is developed to normalize and transform scRNA-seq data for statistical analyses. By using real RNA-seq and scRNA-seq data, Linnorm is compared with existing normalization methods and it shows improvements in multiple aspects. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Nucleotide sequence - Statistical methods | - |
dc.title | Statistical analysis of RNA-seq and scRNA-seq expression data | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Biomedical Sciences | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.5353/th_991044069403703414 | - |
dc.date.hkucongregation | 2018 | - |
dc.identifier.mmsid | 991044069403703414 | - |