
关于
Scanpy 是基于 AnnData 的可扩展 Python 单细胞 RNA-seq 数据分析工具包。适用于完整的单细胞工作流,包括质量控制、标准化、降维、聚类、标记基因识别、可视化和轨迹分析。
name: scanpy description: "Scanpy 是一个可扩展的 Python 工具包,用于分析单细胞 RNA-seq 数据,基于 AnnData 构建。适用于完整的单细胞工作流,包括质量控制、归一化、降维、聚类、标记基因识别、可视化和轨迹分析。" license: SD-3-Clause license metadata: skill-author: K-Dense Inc. risk: unknown source: community
Scanpy:单细胞分析
概述
Scanpy 是一个可扩展的 Python 工具包,用于分析单细胞 RNA-seq 数据,基于 AnnData 构建。适用于完整的单细胞工作流,包括质量控制、归一化、降维、聚类、标记基因识别、可视化和轨迹分析。
适用场景
本技能适用于:
- 分析单细胞 RNA-seq 数据(.h5ad、10X、CSV 格式)
- 对 scRNA-seq 数据集进行质量控制
- 创建 UMAP、t-SNE 或 PCA 可视化
- 识别细胞聚类和寻找标记基因
- 基于基因表达进行细胞类型注释
- 进行轨迹推断或伪时间分析
- 生成出版质量的单细胞图表
快速开始
基本导入和设置
import scanpy as sc
import pandas as pd
import numpy as np
# Configure settings
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white')
sc.settings.figdir = './figures/'
加载数据
# From 10X Genomics
adata = sc.read_10x_mtx('path/to/data/')
adata = sc.read_10x_h5('path/to/data.h5')
# From h5ad (AnnData format)
adata = sc.read_h5ad('path/to/data.h5ad')
# From CSV
adata = sc.read_csv('path/to/data.csv')
理解 AnnData 结构
AnnData 对象是 scanpy 的核心数据结构:
adata.X # Expression matrix (cells × genes)
adata.obs # Cell metadata (DataFrame)
adata.var # Gene metadata (DataFrame)
adata.uns # Unstructured annotations (dict)
adata.obsm # Multi-dimensional cell data (PCA, UMAP)
adata.raw # Raw data backup
# Access cell and gene names
adata.obs_names # Cell barcodes
adata.var_names # Gene names
标准分析工作流
1. 质量控制
识别并过滤低质量细胞和基因:
# Identify mitochondrial genes
adata.var['mt'] = adata.var_names.str.startswith('MT-')
# Calculate QC metrics
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
# Visualize QC metrics
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True)
# Filter cells and genes
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.pct_counts_mt < 5, :] # Remove high MT% cells
使用 QC 脚本进行自动化分析:
python scripts/qc_analysis.py input_file.h5ad --output filtered.h5ad
2. 归一化和预处理
# Normalize to 10,000 counts per cell
sc.pp.normalize_total(adata, target_sum=1e4)
# Log-transform
sc.pp.log1p(adata)
# Save raw counts for later
adata.raw = adata
# Identify highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pl.highly_variable_genes(adata)
# Subset to highly variable genes
adata = adata[:, adata.var.highly_variable]
# Regress out unwanted variation
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
# Scale data
sc.pp.scale(adata, max_value=10)
3. 降维
# PCA
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True) # Check elbow plot
# Compute neighborhood graph
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
# UMAP for visualization
sc.tl.umap(adata)
sc.pl.umap(adata, color='leiden')
# Alternative: t-SNE
sc.tl.tsne(adata)
4. 聚类
# Leiden clustering (recommended)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color='leiden', legend_loc='on data')
# Try multiple resolutions to find optimal granularity
for res in [0.3, 0.5, 0.8, 1.0]:
sc.tl.leiden(adata, resolution=res, key_added=f'leiden_{res}')
5. 标记基因识别
# Find marker genes for each cluster
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
# Visualize results
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.pl.rank_genes_groups_heatmap(adata, n_genes=10)
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5)
# Get results as DataFrame
markers = sc.get.rank_genes_groups_df(adata, group='0')
6. 细胞类型注释
# Define marker genes for known cell types
marker_genes = ['CD3D', 'CD14', 'MS4A1', 'NKG7', 'FCGR3A']
# Visualize markers
sc.pl.umap(adata, color=marker_genes, use_raw=True)
sc.pl.dotplot(adata, var_names=marker_genes, groupby='leiden')
# Manual annotation
cluster_to_celltype = {
'0': 'CD4 T cells',
'1': 'CD14+ Monocytes',
'2': 'B cells',
'3': 'CD8 T cells',
}
兼容工具
Claude CodeCursor
标签
数据工程
