Statistical analysis of RNA-seq and scRNA-seq expression data

Yip, Shun-hang; 葉信恆

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_991044069403703414

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Biomedical Sciences: Theses

postgraduate thesis: Statistical analysis of RNA-seq and scRNA-seq expression data

Title	Statistical analysis of RNA-seq and scRNA-seq expression data
Authors	Yip, Shun-hang 葉信恆
Advisors	Advisor(s):Sham, PC Wang, JJ
Issue Date	2018
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yip, S. [葉信恆]. (2018). Statistical analysis of RNA-seq and scRNA-seq expression data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	RNA-seq is a popular technique that utilizes next-generation sequencing to obtain transcriptome information from a cell population. It generates short sequences of reads from the transcriptome, which can be utilized for gene annotation, expression quantification, fusion gene detection, differentially expressed gene (DEG) analysis, etc. This technology can be applied to single cells, which enables more in-depth study of the transcriptome. Single cell RNA-seq (scRNA-seq) enables highly variable transcript discovery, cell subpopulation analysis, etc., in addition to common RNA-seq data analyses. The analysis of RNA-seq data can be separated into two main categories. The first category focuses on the read sequences. This data allows the analysis of alternative splicing, gene annotation, post-transcriptional modifications, gene fusion, etc. The second category focuses on the expression data, which are obtained by counting the number of reads generated from each gene or transcript. Analysis of the expression data includes DEG analysis, highly variable transcript discovery, cell subpopulation analysis, etc. This thesis begins by briefly describing the backgrounds in RNA-seq analysis and the commonly utilized pipelines from the first category. The main focus of this thesis is the statistical analysis of the expression data. RNA-seq analysis tools that analyze expression data can often perform DEG analysis; and they are previously shown to have advantages over each other in different aspects. For example, voom controls false positive rates well, DESeq2 is focused on precision and edgeR has an advantage in overall accuracy. This prompts the development a new method that can perform optimally in all of these aspects. On the other hand, scRNA-seq is a newer technology and many tools are developed recently. Compared to RNA-seq data, scRNA-seq expression matrices contain a higher amount of zero counts and their expression estimates are often less accurate. Hence, scRNA-seq analysis methods are often emphasized on technical noise reduction. Utilizing DEG analysis, which is a basic statistical test, existing scRNA-seq tools are shown to be inferior to existing RNA-seq methods in controlling false positive rates with real scRNA-seq data. To improve current analysis pipelines, the issue is pinpointed to the normalization and transformation step, which is crucial for the reduction of technical noises. The linear model and normality based normalization and transformation method (Linnorm) is developed to normalize and transform scRNA-seq data for statistical analyses. By using real RNA-seq and scRNA-seq data, Linnorm is compared with existing normalization methods and it shows improvements in multiple aspects.
Degree	Doctor of Philosophy
Subject	Nucleotide sequence - Statistical methods
Dept/Program	Biomedical Sciences
Persistent Identifier	http://hdl.handle.net/10722/266318

DC Field	Value	Language
dc.contributor.advisor	Sham, PC	-
dc.contributor.advisor	Wang, JJ	-
dc.contributor.author	Yip, Shun-hang	-
dc.contributor.author	葉信恆	-
dc.date.accessioned	2019-01-18T01:52:02Z	-
dc.date.available	2019-01-18T01:52:02Z	-
dc.date.issued	2018	-
dc.identifier.citation	Yip, S. [葉信恆]. (2018). Statistical analysis of RNA-seq and scRNA-seq expression data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/266318	-
dc.description.abstract	RNA-seq is a popular technique that utilizes next-generation sequencing to obtain transcriptome information from a cell population. It generates short sequences of reads from the transcriptome, which can be utilized for gene annotation, expression quantification, fusion gene detection, differentially expressed gene (DEG) analysis, etc. This technology can be applied to single cells, which enables more in-depth study of the transcriptome. Single cell RNA-seq (scRNA-seq) enables highly variable transcript discovery, cell subpopulation analysis, etc., in addition to common RNA-seq data analyses. The analysis of RNA-seq data can be separated into two main categories. The first category focuses on the read sequences. This data allows the analysis of alternative splicing, gene annotation, post-transcriptional modifications, gene fusion, etc. The second category focuses on the expression data, which are obtained by counting the number of reads generated from each gene or transcript. Analysis of the expression data includes DEG analysis, highly variable transcript discovery, cell subpopulation analysis, etc. This thesis begins by briefly describing the backgrounds in RNA-seq analysis and the commonly utilized pipelines from the first category. The main focus of this thesis is the statistical analysis of the expression data. RNA-seq analysis tools that analyze expression data can often perform DEG analysis; and they are previously shown to have advantages over each other in different aspects. For example, voom controls false positive rates well, DESeq2 is focused on precision and edgeR has an advantage in overall accuracy. This prompts the development a new method that can perform optimally in all of these aspects. On the other hand, scRNA-seq is a newer technology and many tools are developed recently. Compared to RNA-seq data, scRNA-seq expression matrices contain a higher amount of zero counts and their expression estimates are often less accurate. Hence, scRNA-seq analysis methods are often emphasized on technical noise reduction. Utilizing DEG analysis, which is a basic statistical test, existing scRNA-seq tools are shown to be inferior to existing RNA-seq methods in controlling false positive rates with real scRNA-seq data. To improve current analysis pipelines, the issue is pinpointed to the normalization and transformation step, which is crucial for the reduction of technical noises. The linear model and normality based normalization and transformation method (Linnorm) is developed to normalize and transform scRNA-seq data for statistical analyses. By using real RNA-seq and scRNA-seq data, Linnorm is compared with existing normalization methods and it shows improvements in multiple aspects.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Nucleotide sequence - Statistical methods	-
dc.title	Statistical analysis of RNA-seq and scRNA-seq expression data	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Biomedical Sciences	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_991044069403703414	-
dc.date.hkucongregation	2018	-
dc.identifier.mmsid	991044069403703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Statistical analysis of RNA-seq and scRNA-seq expression data

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats