A general view of Python-based scRNA-seq data analysis

Standard procedure of analyzing scRNA data

Pre-processing (10x; cellranger)
Quality control (filter; doublets)
Normalization
Data correction and integration
Feature selection
Dimensionality reduction and visualization
Cluster analysis and annotation
Compositional analysis
Trajectory analysis
Gene-level analysis (DEGs, Enrichment, GRN)

Basic concepts

箱型图 Box plot

Annotated data -- AnnData

数据结构
常用函数

Standard procedure of analyzing scRNA data

Pre-processing (10x; cellranger)

Collect data from 10x. Then use cellranger, a commercial software for data pre-processing, to get cell count matrix.

Quality control (filter; doublets)

After getting count matrix, quality control (QC) is needed to be implemented first. The purpose of QC has two sides. Firstly, we have to remove data of low quality, which may reflect the results of analysis. Then, we can examine the distribution of the collected data. In the following figure, every point denotes a cell. Plots in this figure respectively describes the genes detected of each single cell, the number of counts (?), and the proportion of mitochondrial genes.

After the general steps of quality control are implemented, it is also of great importance to deal with "doublets". Normally, one droplet only contains a single cell. However, restricted to the current technology, it is possible to obtain two cells in a droplet. The existing of double cells may lead to errors in the following analyzing steps. Currently, common ways to do doublets deletion are python-based (DoubletDetection) and R-based (DoubletFinder).

Normalization

For 10x scRNA-seq data, mainstream method is to normalize the counts to 10000, then take log function.

Data correction and integration

Due to the limitation of current technologies, some genes with low expression levels may not be detected. The aim of data correction is to do imputation, recover the true expression data. Novel method to do imputation is MAGIC, kNN-smoothing and SAVER. However, we have to clarify that the imputation may not make a difference to the downstream analysis, the main idea of imputation is to try to exhibit the whole gene profile.

At the same time, RNA data may be collected from different batches. To greatly analyze them, we need to do integration. Top3 methods to do integration according to prior study are Harmony, LIGER, and Seurat3.

Feature selection

The sparsity of scRAN-seq data is very large. Thus, we have to select features of high value and representative to do the following analysis.

Dimensionality reduction and visualization

Current ways to reduce dimensionality are UMAP, TSNE, LSI, PCA.

Cluster analysis and annotation

Compositional analysis

Trajectory analysis

Gene-level analysis (DEGs, Enrichment, GRN)

Basic concepts

箱型图 Box plot

In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles. Elements of a boxplot contain minimum ($Q_0$ or 0th percentile), maximum ($Q_4$ or 100th percentile), median ($Q_2$ or 50th percentile), first quartile ($Q_1$ or 25th percentile), third quartile ($Q_3$ or 75th percentile).

Annotated data -- AnnData

数据结构

结构	功能	数据类型
adata.X	矩阵数据	numpy数组、scipy sparse矩阵
adata.obs	观测值数据	pandas DataFrame
adata.var	特征和高可变基因数据	pandas DataFrame
adata.uns	非结构化数据	字典dict

adata.X本身不包含行名和列名信息，其行名和adata.obs共享，列名和adata.var共享

常用函数

# 数据数目统计
adata.n_obs # 返回细胞数
adata.n_vars # 返回基因数
adata.shape # 返回矩阵大小

# 数据键值提取
adata.obs_keys() # 获取细胞注释信息的Keys
adata.obs_names # 返回细胞ID，类型为Object
adata.var_index # 返回基因，类型为Object
adata.var_names.to_list() # 以List类型返回基因
adata.obs.head() # 查看Obs的前5行数据