🏠 Home | 🌐 δΈ­ζ–‡

🧬 DNBelab C Series HT scRNA Analysis Output Documentation

A Complete Guide to Single-Cell RNA Sequencing Analysis Output Files

πŸ“ Directory Structure β€’ πŸ“‹ File Details β€’ 🧬 Data Matrix β€’ πŸ“Š Analysis Results β€’ πŸ“Š Report Interpretation


πŸ“– Overview

After the single-cell RNA analysis is complete, a standardized file and subdirectory structure is generated in the specified output directory, specifically for gene expression profile analysis and cell type identification. This document details the content, format, and purpose of each output file to help users fully understand and efficiently utilize the single-cell RNA analysis results.

πŸ’‘ Tip: All output files use standard formats compatible with mainstream single-cell analysis tools (such as Scanpy, Seurat, etc.) and follow internationally recognized data format specifications.


πŸ“ Output Directory Structure

.
β”œβ”€β”€ analysis/                      # Downstream analysis results directory
β”‚   β”œβ”€β”€ cluster.csv                # Cell clustering results file
β”‚   β”œβ”€β”€ marker.csv                 # Differentially expressed gene marker file
β”‚   └── QC_Cluster.h5ad            # AnnData object after quality control and clustering
β”œβ”€β”€ anno_decon_sorted.bam          # Aligned, annotated, and sorted BAM file
β”œβ”€β”€ anno_decon_sorted.bam.bai      # BAM index file
β”œβ”€β”€ filter_feature.h5ad            # Filtered feature matrix (AnnData format)
β”œβ”€β”€ filter_matrix/                 # Filtered gene expression matrix directory
β”‚   β”œβ”€β”€ barcodes.tsv.gz            # Cell barcode file
β”‚   β”œβ”€β”€ features.tsv.gz            # Gene/feature information file
β”‚   └── matrix.mtx.gz              # Sparse matrix file (Market Matrix format)
β”œβ”€β”€ metrics_summary.xls            # Analysis metrics summary table
β”œβ”€β”€ raw_matrix/                    # Raw gene expression matrix directory
β”‚   β”œβ”€β”€ barcodes.tsv.gz            # Raw cell barcode file
β”‚   β”œβ”€β”€ features.tsv.gz            # Raw gene/feature information file
β”‚   └── matrix.mtx.gz              # Raw sparse matrix file
β”œβ”€β”€ singlecell.csv                 # Single-cell metadata information table
└── *_scRNA_report.html            # Analysis report in HTML format

πŸ“‹ Detailed File Description

🧬 Alignment and Annotation Files

🎯 Core Content: Result files from aligning raw sequencing data to the reference genome, containing complete alignment information and cell barcode tags.

-----------

πŸ“„ anno_decon_sorted.bam

This is the scRNA-seq alignment result file containing all raw data.

-----------

πŸ“„ anno_decon_sorted.bam.bai

The index file for anno_decon_sorted.bam.


πŸ“ˆ Feature Matrix Files

🎯 Core Content: Single-cell gene expression count matrices, divided into raw and quality-controlled filtered data, using standard sparse matrix or AnnData format.

πŸ“ Filtered Gene Expression Matrix (filter_matrix/)

Contains the gene expression count matrix after filtering for high-quality cells, which is the core data for downstream quantitative analysis.

-----------

πŸ“ Raw Gene Expression Matrix (raw_matrix/)

Contains the raw gene expression count matrix for all detected cell barcodes (unfiltered).

-----------

πŸ“„ filter_feature.h5ad

The feature matrix after cell identification and filtering, stored in AnnData (.h5ad) format. It is an alternative and supplement to the contents of the filter_matrix/ directory.


πŸ“Š Analysis Results Directory (analysis/)

🎯 Core Content: Results of downstream bioinformatics analysis, including cell clustering, differential genes, and post-QC data.

-----------

πŸ“„ cluster.csv

The cell clustering analysis result file in CSV format. It contains each cell's ID, its assigned cluster, dimensionality reduction coordinates, and key QC metrics.

-----------

πŸ“„ marker.csv

A list of differentially expressed genes (marker genes) for each cluster, in CSV format. It records information such as the significance of each gene's expression in a specific cluster and changes in expression levels.

-----------

πŸ“„ QC_Cluster.h5ad

A single-cell data object that has undergone complete quality control, dimensionality reduction, and clustering analysis, in AnnData (.h5ad) format. It integrates the upstream expression matrix with downstream analysis results.


πŸ“ Analysis Metrics Summary

🎯 Core Content: A summary of experimental quality assessment and statistical metrics, providing comprehensive data quality control information.

πŸ“„ metrics_summary.xls

A summary table of key analysis metrics in Excel format, providing a comprehensive assessment of the overall quality of the experiment.

-----------

πŸ“„ singlecell.csv

A single-cell level quality control information table in CSV format, recording detailed statistical data for each cell barcode.

-----------

πŸ“„ *_scRNA_report.html

An interactive comprehensive analysis report in HTML web format.


πŸ“„ File Format Description

Technical Specifications: Detailed descriptions of the standard formats used for output files.

πŸ“Š Market Matrix Format (.mtx.gz)

The Market Exchange Format (MEX) is a standard format used in single-cell analysis for storing sparse count matrices, offering advantages of space efficiency and high compatibility.


πŸ—ƒοΈ AnnData Format (.h5ad)

Format Overview: AnnData ("Annotated Data") is a data structure designed for matrix-like data, particularly suitable for single-cell RNA sequencing data analysis. Based on the HDF5 format, it provides efficient data storage and access capabilities.

πŸ—οΈ Data Structure

AnnData Format Structure Diagram
πŸ“ Component 🎯 Function πŸ“ Dimensions
X Main expression matrix n_cells Γ— n_genes
obs Cell metadata n_cells Γ— n_obs_features
var Gene metadata n_genes Γ— n_var_features
obsm Cell multidimensional data n_cells Γ— n_components
varm Gene multidimensional data n_genes Γ— n_components
layers Multi-layer data n_cells Γ— n_genes
uns Unstructured data Any object

πŸ“Š Web Report Interpretation

🎯 Overview: The HTML web report provides a comprehensive visual display and detailed interpretation of single-cell RNA sequencing analysis results, including the evaluation of key performance indicators, to help users quickly understand the experimental quality and analysis results.

The HTML web report is a comprehensive display platform for single-cell RNA sequencing analysis, integrating complete results from data quality control to downstream biological analysis. The report uses an interactive visual design to help users quickly assess experimental quality, understand analysis results, and guide future research directions.

πŸ’‘ Usage Suggestion: It is recommended to review the metrics in the order they are presented in the report.

⚠️ Quality Standards: Each metric is provided with recommended thresholds and quality levels. Please conduct a comprehensive evaluation based on specific experimental goals.

πŸ“Š Main Content and Structure of the Report

scRNA Web Report

🧬 Detailed Explanation of Core Analysis Metrics

🧬 Cell Metrics

🎯 Core Function: Cell identification, quality assessment, and gene expression statistics, providing key indicators of the overall effectiveness of the experiment.

πŸ“Š Quality Control Standards:

Note: The following standards are for reference only. Actual quality assessment should consider multiple factors such as tissue type, cell state, and experimental goals. Significant differences may exist between different samples, and it is recommended to make judgments based on the specific experimental context.

Metric Name Recommended Acceptable Needs Improvement
Mean reads per cell β‰₯ 30,000 15,000–30,000 < 15,000
Median genes per cell β‰₯ 1,000 500–1,000 < 500
Fraction reads in cells β‰₯ 60% 30–60% < 30%
Sequencing saturation β‰₯ 40% 20–40% < 20%

πŸ” Detailed Metric Explanations:

Metric Name Detailed Explanation and Technical Requirements
Estimated number of cells
Estimated Cell Count
  • Definition: The total number of valid cells (as opposed to background noise or empty droplets) identified from the sequencing data.
  • Calculation Process: After merging cell barcodes from the same droplet, real cells are predicted based on an empty-droplet model (EmptyDrops).
  • Quality Interpretation:
    • Abnormal Causes: Inaccurate cell counting, cell lysis, poor sample or library quality, low sequencing depth.
Species
Species Information
  • Definition: The species or reference genome version used for the analysis.
  • Description: This information comes from the reference genome provided during library construction and is used to ensure the accuracy of alignment and annotation.
Mean reads per cell
Mean Reads per Cell
  • Definition: The average number of raw sequencing reads allocated to each cell.
  • Calculation: `Total number of raw sequencing reads / Estimated number of cells`
  • Quality Interpretation: A value β‰₯ 30,000 is recommended to ensure sufficient transcript coverage.
Median/Mean UMI per cell
Median/Mean UMIs per Cell
  • Definition: The median/mean number of unique molecular identifiers (UMIs) detected in each cell.
  • Biological Significance: Used to assess the gene expression level of single-cell sequencing, more accurately reflecting the abundance of original mRNA molecules than read counts.
  • Quality Interpretation: This metric is affected by cell type, sequencing depth, and library quality. A low value may indicate insufficient sequencing depth or poor sample quality.
Median/Mean genes per cell
Median/Mean Genes per Cell
  • Definition: The median/mean number of genes detected within a single cell.
  • Biological Significance: This metric directly reflects the complexity of the single-cell transcriptome and the sequencing depth. A higher value indicates better single-cell data quality.
  • Quality Interpretation:
    • Note: This value is highly dependent on cell type and sequencing depth. Cell types with low transcript content (such as blood cells) may have a lower value.
Total genes detected
Total Genes Detected
  • Definition: The total number of genes detected in the entire sample, requiring each gene to have at least one UMI count in at least one cell.
  • Biological Significance: Reflects the overall transcriptome complexity of the sample and whether the sequencing was comprehensive.
  • Quality Interpretation: A low value may indicate insufficient sequencing depth or a uniform cell type in the sample.
Fraction reads in cells
Fraction of Reads in Cells
  • Definition: The proportion of reads successfully assigned to high-quality cell IDs among all validly aligned reads (with valid barcodes/UMIs and confidently mapped to the transcriptome).
  • Biological Significance: Reflects the efficiency of cell capture and the signal-to-noise ratio.
  • Quality Interpretation:
    • Quality Issues: A low proportion may indicate poor sample quality (e.g., extensive cell fragmentation releasing free-floating RNA) or abnormalities in library construction.
Sequencing saturation
Sequencing Saturation
  • Definition: A metric to assess whether sequencing depth is sufficient, calculated as `1 - (number of deduplicated UMIs / total number of reads)`.
  • Biological Significance: Reflects library complexity and the cost-effectiveness of sequencing. High saturation means that increasing sequencing depth yields diminishing returns in discovering new genes.
  • Typical Range: A range of 40% – 85% is considered ideal.
-----------

πŸ”¬ Sequencing Metrics

🎯 Core Function: Basic quality assessment of sequencing data, including barcode identification rate, UMI quality, and sequencing accuracy.

πŸ“Š Quality Control Standards:

Note: The following standards are for reference only. Actual quality assessment should consider multiple factors such as tissue type, cell state, and experimental goals. Significant differences may exist between different samples, and it is recommended to make judgments based on the specific experimental context.

Metric Category Recommended Acceptable Needs Improvement
Valid barcodes β‰₯ 80% 70–80% < 70%
Valid UMIs β‰₯ 80% 70–80% < 70%
Q30 Base Quality β‰₯ 85% 75–85% < 75%

πŸ” Detailed Metric Explanations:

Metric Name Detailed Explanation and Technical Requirements
Number of reads
Total Number of Reads
  • Definition: The total number of raw sequencing read pairs assigned to this sample.
  • Significance: Represents the overall data volume of this sequencing run. Theoretically, a higher number of reads provides more comprehensive coverage of the cell's transcriptome.
Valid barcodes
Valid Barcode Fraction
  • Definition: The proportion of all reads whose Cell Barcode can be matched to a preset whitelist (after error correction).
  • Biological Significance: Reflects the effectiveness of cell labeling.
  • Quality Interpretation: A very low proportion usually indicates sample quality issues leading to barcode degradation and adapter contamination, or a high error rate during the sequencing process.
Valid UMIs
Valid UMI Fraction
  • Definition: The proportion of all reads whose Unique Molecular Identifier (UMI) sequence does not contain 'N' bases and is not a homopolymer (e.g., AAAAAA).
  • Biological Significance: Reflects the sequencing quality of the UMI sequence, which is key for accurate molecular counting.
Q30 bases in barcode/UMI/read
Q30 Base Fraction
  • Definition: The proportion of bases with a sequencing quality score of Q30 or higher in the cell barcode, UMI, and RNA read sequences.
  • Significance: Q30 represents a sequencing error rate of less than 0.1%. This metric directly affects the accuracy of cell identification, molecular counting, and gene alignment.

Note: All proportions above are calculated based on the total number of raw sequencing reads (Number of Reads), ensuring comparability and consistency across metrics.

-----------

πŸ—ΊοΈ Mapping Metrics

🎯 Core Function: To assess the quality of read alignment to the reference genome, including alignment rate, specificity, and genomic region distribution.

πŸ“Š Quality Control Standards:

Note: The following standards are for reference only. Actual quality assessment should consider multiple factors such as tissue type, cell state, and experimental goals. Significant differences may exist between different samples, and it is recommended to make judgments based on the specific experimental context.

Metric Name Recommended Acceptable Needs Improvement
Reads mapped to genome β‰₯ 80% 50–80% < 50%
Reads mapped confidently to transcriptome β‰₯ 50% 30-50% < 30%
Reads mapped antisense to gene < 10% 10-30% > 30%

πŸ” Detailed Metric Explanations:

Metric Name Detailed Explanation and Technical Requirements
Reads mapped to genome
Genome Alignment Rate
  • Definition: The proportion of all reads that successfully align to any location on the reference genome (including unique and multiple alignments).
  • Quality Interpretation:
    • Needs Attention: A rate below 50% may indicate sample contamination (e.g., bacteria) or species mismatch.
Reads mapped confidently to genome
Confident Genome Alignment Rate
  • Definition: The proportion of all reads that align with high quality (STAR MAPQ value of 255) to a unique location on the genome.
  • Technical Detail: For multi-mapping reads, they are corrected to confident reads in one specific case: when the read aligns to both an exonic region and one or more non-exonic regions, the pipeline accepts its alignment in the exonic region and retains it.
  • Biological Significance: This forms the basis of valid data for gene expression quantification and regional analysis. A low proportion may be caused by repetitive sequences, poor sequence quality, or a mismatched reference genome.
Reads mapped confidently to transcriptome
Confident Transcriptome Alignment Rate
  • Definition: The proportion of all reads that can be uniquely aligned with high confidence to a single gene (including exons and introns by default).
  • Technical Detail: To ensure quantification accuracy, if a read's alignment region overlaps with multiple different genes, the read is considered of ambiguous origin and filtered out.
  • Biological Significance: This is a core metric for assessing library quality and data reliability. A higher proportion means more valid data for downstream quantitative analysis and more reliable results.
Reads mapped confidently to exonic regions
Exonic Region Alignment Rate
  • Definition: The proportion of reads confidently mapped to the genome that fall into annotated exonic regions.
  • Technical Detail: A read is considered confidently mapped to an exonic region only if at least 50% of it falls within an exonic region.
  • Biological Significance: This is the main source of mature mRNA and a core metric for assessing library quality. In standard whole-cell scRNA-seq, this proportion should be high.
Reads mapped confidently to intronic regions
Intronic Region Alignment Rate
  • Definition: The proportion of reads confidently mapped to the genome that fall into annotated intronic regions.
  • Technical Detail: A read is considered confidently mapped to an intronic region only if it does not meet the criteria for exonic region classification and intersects with an intronic region.
  • Biological Significance: A high proportion usually indicates the capture of a large amount of unspliced pre-mRNA. This is expected in nuclear sequencing (snRNA-seq).
Reads mapped confidently to intergenic regions
Intergenic Region Alignment Rate
  • Definition: The proportion of reads confidently mapped to the genome that do not fall into any annotated gene (including exons and introns).
  • Quality Interpretation: An excessively high proportion may suggest incomplete gene annotation or non-specific amplification in the library.
Reads mapped antisense to gene
Antisense Alignment Rate
  • Definition: The proportion of reads that successfully align to a gene region but in the opposite direction to the annotated gene.
  • Quality Interpretation: An excessively high proportion may indicate directionality issues during library construction or the presence of unknown antisense transcripts.
Include introns
Include Introns
  • Definition: Controls whether reads aligned to intronic regions are included in gene expression counts.
  • Enabled State (Default): When set to `True`, reads from intronic regions are counted towards the expression of the corresponding gene. This mode captures gene activity more comprehensively, especially suitable for nuclear sequencing or scenarios requiring pre-mRNA analysis.
  • Disabled State: When set to `False`, only exonic reads are counted towards gene expression. This mode focuses on the quantification of mature mRNA.

Note: All proportions above are calculated based on the total number of raw sequencing reads (Number of Reads), ensuring comparability and consistency across metrics.

-----------

πŸ“ˆ Interactive Visualization Chart Interpretation

🎯 Core Function: Provides comprehensive data visualization analysis, from cell quality control to a complete display of downstream biological analysis.

πŸ“Š Visualization Chart Group One: Cell Quality Control Analysis

πŸ“Š Barcode Rank Plot

Chart Function:
This plot distinguishes high-quality real cells from background noise by ranking all cells by their UMI count.

scRNA Web Report

How to Interpret:

-----------
πŸ“Š Droplet Beads Distribution

Chart Function:
Displays the distribution of the number of captured cell barcodes (Beads) in real cell droplets.

How to Interpret:

-----------
πŸ“Š Cell Data Distribution

Chart Function:
Through three separate violin plots, it shows the distribution of high-quality cells across three key quality metrics: number of genes (nGenes), number of UMIs (nUMI), and mitochondrial gene percentage (percent.mt).

How to Interpret:



scRNA Web Report

πŸ“Š Visualization Chart Group Two: Downstream Biological Analysis

🎯 Core Function: A comprehensive display of cell clustering analysis, differential gene identification, cell type annotation, and sequencing depth assessment.

πŸŒ€ Cluster Analysis

Chart Function:
Using UMAP dimensionality reduction and the Louvain clustering algorithm, cells with similar gene expression patterns are grouped together in a 2D space, thereby identifying potential cell subpopulations.

How to Interpret:

-----------
πŸ“ˆ Marker Genes Analysis

Chart Function:
Displays the characteristic differentially expressed genes for each cell cluster, used to identify and annotate different cell types.

How to Interpret:

-----------
🧬 Cell Type Annotation

Chart Function:
On the UMAP plot, each cluster is labeled with a cell type inferred from a reference database (e.g., scHCL, scMCA).

How to Interpret:

-----------
πŸ“Š Sequencing Saturation Curve

Chart Function:
Assesses the adequacy of sequencing depth and data complexity, i.e., whether further increasing the sequencing volume can lead to the discovery of more new genes or UMIs.

How to Interpret:


🎯 More Resources

Document Type Resource Link and Description
πŸš€ Quick Start Quick Start Guide - A complete tutorial for your first analysis.
βš™οΈ Parameter Reference Parameter Reference Manual - Detailed descriptions of all configurable parameters.
πŸ”¬ Analysis Pipeline Analysis Pipeline Description - Technical details of the entire analysis pipeline.
πŸ”§ Installation & Configuration Installation & Configuration Guide - System requirements, installation steps, and environment configuration.

πŸ’‘ Tip

This document is continuously updated. If you find any errors or need additional information, please provide feedback.

πŸ“ Document Version: 3.0 beta | Last Updated: 2025


πŸ”¬ DNBelab C Series HT scRNA Analysis Software
A High-Performance Single-Cell RNA Sequencing Data Analysis Pipeline