🧬 DNBelab C Series HT scRNA Analysis Output Documentation

A Complete Guide to Single-Cell RNA Sequencing Analysis Output Files

📁 Directory Structure • 📋 File Details • 🧬 Data Matrix • 📊 Analysis Results • 📊 Report Interpretation

📖 Overview

After the single-cell RNA analysis is complete, a standardized file and subdirectory structure is generated in the specified output directory, specifically for gene expression profile analysis and cell type identification. This document details the content, format, and purpose of each output file to help users fully understand and efficiently utilize the single-cell RNA analysis results.

💡 Tip: All output files use standard formats compatible with mainstream single-cell analysis tools (such as Scanpy, Seurat, etc.) and follow internationally recognized data format specifications.

📁 Output Directory Structure

.
├── analysis/                      # Downstream analysis results directory
│   ├── cluster.csv                # Cell clustering results file
│   ├── marker.csv                 # Differentially expressed gene marker file
│   └── QC_Cluster.h5ad            # AnnData object after quality control and clustering
├── anno_decon_sorted.bam          # Aligned, annotated, and sorted BAM file
├── anno_decon_sorted.bam.bai      # BAM index file
├── filter_feature.h5ad            # Filtered feature matrix (AnnData format)
├── filter_matrix/                 # Filtered gene expression matrix directory
│   ├── barcodes.tsv.gz            # Cell barcode file
│   ├── features.tsv.gz            # Gene/feature information file
│   └── matrix.mtx.gz              # Sparse matrix file (Market Matrix format)
├── metrics_summary.xls            # Analysis metrics summary table
├── raw_matrix/                    # Raw gene expression matrix directory
│   ├── barcodes.tsv.gz            # Raw cell barcode file
│   ├── features.tsv.gz            # Raw gene/feature information file
│   └── matrix.mtx.gz              # Raw sparse matrix file
├── singlecell.csv                 # Single-cell metadata information table
└── *_scRNA_report.html            # Analysis report in HTML format

📋 Detailed File Description

🧬 Alignment and Annotation Files

🎯 Core Content: Result files from aligning raw sequencing data to the reference genome, containing complete alignment information and cell barcode tags.

-----------

📄 anno_decon_sorted.bam

This is the scRNA-seq alignment result file containing all raw data.

Core Purpose:
- In-depth Analysis and Visualization: Can be used for deep visualization in genome browsers like IGV to inspect alignment situations and splicing patterns at specific gene loci.
- Custom Analysis: Provides raw input for users who need to directly manipulate alignment-level data, such as for alternative splicing analysis, RNA velocity analysis, etc.
Content and Format:
- Uses the international standard BAM (Binary Alignment Map) format.
- The file is sorted by genomic coordinates and indexed (the .bai file), allowing for fast random access.
- Each read is tagged with cell origin, UMI, and gene annotation information through TAG fields.

Key TAG Field Descriptions:

The BAM file uses rich TAG fields to store single-cell specific information, mainly divided into cell/molecule identifiers and gene annotations.

🧬 Cell and Molecular Identifier Tags:

Tag	Type	Description	Biological Significance
`CB`	String	Cell ID after merging cell barcodes	Used to assign reads to a specific cell; it is the final cell ID after error correction and merging.
`CC`	String	Error-corrected cell barcode sequence	The corrected cell barcode, an intermediate step in generating the `CB` tag.
`CR`	String	Raw sequencing cell barcode	Retains original sequencing information for quality assessment and error tracing.
`CY`	String	Cell barcode quality score	Phred quality score, assessing the reliability of barcode sequencing.
`UB`	String	Error-corrected UMI sequence	Used for molecular deduplication to identify PCR duplicates and original mRNA molecules.
`UR`	String	Raw sequencing UMI sequence	Retains original UMI information for quality assessment and algorithm optimization.
`UY`	String	UMI quality score	Phred quality score, assessing the accuracy of UMI sequencing.

🧬 Gene Annotation and Functional Tags:

Tag	Type	Description	Functional Purpose
`GX`	String	Ensembl ID	The primary ID for gene expression quantification.
`GN`	String	Gene name	Facilitates biological interpretation and supports gene function annotation.
`TX`	String	Transcript ID	Used for transcript-level expression analysis and alternative splicing studies.
`AN`	String	Antisense transcript tag	Identifies antisense RNA, assessing library directionality and non-coding RNA expression.
`RE`	String	Genomic region type	Distinguishes between Exon (E), Intron (N), and Intergenic (I) regions for transcriptome feature analysis.

-----------

📄 anno_decon_sorted.bam.bai

The index file for anno_decon_sorted.bam.

Core Purpose:
- Fast Data Access: Allows tools like IGV and Samtools to quickly jump to and read alignment data for any genomic region without loading the entire BAM file.
- Performance Guarantee: Ensures performance for all random access operations on the BAM file.

Format and Description:

The index file is generated by the samtools index command. To accommodate genomes of different sizes, the pipeline automatically selects the appropriate index format (BAI or CSI).

Format Type	Usage Instructions
BAI Format	The default index format, offering the best compatibility and suitable for most analysis tools and genomes.
CSI Format	Automatically generated when the BAM file contains chromosomes longer than 512 Mbp (2^29-1 bp) to support very large genomes.

📈 Feature Matrix Files

🎯 Core Content: Single-cell gene expression count matrices, divided into raw and quality-controlled filtered data, using standard sparse matrix or AnnData format.

📁 Filtered Gene Expression Matrix (`filter_matrix/`)

Contains the gene expression count matrix after filtering for high-quality cells, which is the core data for downstream quantitative analysis.

Core Purpose:
- Downstream Quantitative Analysis: Serves as the primary input for analyses such as cell clustering and differential expression analysis.
- High-Quality Data: Includes only barcodes identified as real cells, ensuring the accuracy of the analysis results.

Content and Format:

Uses the standard Market Matrix Exchange (MEX) format (for more on matrix formats, see Market Matrix Format Description), consisting of the following three compressed files:

File Name	Content Description
`barcodes.tsv.gz`	A list of cell IDs, identifying high-quality cells that passed QC. Each line contains one cell ID, corresponding to the column index of the matrix.
`features.tsv.gz`	A gene/feature information file, containing gene ID, name, and type. Each line contains three columns of information, corresponding to the row index of the matrix.
`matrix.mtx.gz`	The gene expression count matrix in Market Matrix format. Contains matrix dimension information and the row, column indices, and values of non-zero elements.

Format Advantages:
- Space Efficient: The sparse matrix format (.mtx) only stores non-zero elements, greatly saving storage space.
- Highly Compatible: The MEX format is a standard in the single-cell community, compatible with almost all mainstream analysis tools like Seurat and Scanpy.

-----------

📁 Raw Gene Expression Matrix (`raw_matrix/`)

Contains the raw gene expression count matrix for all detected cell barcodes (unfiltered).

Core Purpose:
- Quality Control Assessment: Can be used to evaluate the effectiveness of cell filtering or to perform manual filtering based on custom criteria.
- Data Integrity: Retains all original data, which can be used for deep mining or re-analysis if needed.
Content and Format:
- Uses the standard Market Matrix Exchange (MEX) format, with a file composition identical to the filter_matrix/ directory.
- Includes all detected barcodes, including high-quality cells, low-quality cells, and background droplets.

-----------

📄 filter_feature.h5ad

The feature matrix after cell identification and filtering, stored in AnnData (.h5ad) format. It is an alternative and supplement to the contents of the filter_matrix/ directory.

Core Purpose:
- Python Ecosystem Integration: Serves as the standard input format for Python single-cell analysis libraries like scanpy, seamlessly connecting to downstream analysis.
- Data Integration: A single file can encapsulate the expression matrix, cell metadata, and gene metadata, making it easy to manage and share.
Content and Format:
- A binary format based on HDF5. For details, refer to the AnnData Format Description.

📊 Analysis Results Directory (`analysis/`)

🎯 Core Content: Results of downstream bioinformatics analysis, including cell clustering, differential genes, and post-QC data.

-----------

📄 cluster.csv

The cell clustering analysis result file in CSV format. It contains each cell's ID, its assigned cluster, dimensionality reduction coordinates, and key QC metrics.

Core Purpose:
- Clustering Result Visualization: Can be directly used in plotting software to visualize UMAP dimensionality reduction results.
- Basis for Cell Annotation: Provides basic grouping information for manual or automatic cell type annotation.
Content and Format:
- Each row represents a high-quality cell, with major columns including:
  - Barcode: Cell ID
  - Cluster: The cluster number the cell belongs to
  - UMAP_1, UMAP_2: The 2D coordinates from UMAP dimensionality reduction
  - nGene, nUMI: The number of genes and UMIs detected in each cell

-----------

📄 marker.csv

A list of differentially expressed genes (marker genes) for each cluster, in CSV format. It records information such as the significance of each gene's expression in a specific cluster and changes in expression levels.

Core Purpose:
- Cell Type Identification: By looking up known marker genes for cell types, it allows for biological annotation of unsupervised clustering results.
- Functional Enrichment Analysis: Can be used as an input gene list for subsequent functional enrichment analyses like GO and KEGG.
Content and Format:
- Each row represents the differential expression information of a gene in a cluster, with major columns including:
  - cluster: The cluster number for which the gene is a marker
  - gene: Gene name
  - avg_log2FC: Average log2 fold change
  - p_val_adj: Adjusted p-value, assessing statistical significance
  - pct.1, pct.2: The proportion of cells expressing the gene in the target cluster versus other clusters

-----------

📄 QC_Cluster.h5ad

A single-cell data object that has undergone complete quality control, dimensionality reduction, and clustering analysis, in AnnData (.h5ad) format. It integrates the upstream expression matrix with downstream analysis results.

Core Purpose:
- Analysis Reproduction and Exploration: Contains the complete analysis workflow and results, and can be directly loaded in scanpy for in-depth exploratory analysis or visualization.
- Data Delivery: Serves as a delivery file for final analysis results, with a clear structure and complete information.
Content and Format:
- Builds on filter_feature.h5ad by adding the following information:
  - obs: Contains cell metadata such as clustering results (cluster).
  - obsm: Contains dimensionality reduction coordinates (X_umap).
  - uns: Contains unstructured results such as marker genes (marker_genes).

📝 Analysis Metrics Summary

🎯 Core Content: A summary of experimental quality assessment and statistical metrics, providing comprehensive data quality control information.

📄 metrics_summary.xls

A summary table of key analysis metrics in Excel format, providing a comprehensive assessment of the overall quality of the experiment.

Core Purpose:
- Quality Assessment: Quickly evaluate core metrics such as sequencing data quality, alignment efficiency, and cell identification results.
- Results Overview: Provides a comprehensive understanding of the analysis results without needing to view all files.

Content and Format:

Includes three main categories of key metrics:

Metric Category	Included Content
Basic Statistics	Total reads, valid barcode ratio, UMI quality, Q30 base quality, and other basic sequencing metrics.
Cell Identification	Estimated number of cells, median genes/UMIs per cell, sequencing saturation, and other cell calling results.
Alignment Metrics	Genome alignment rate, transcriptome alignment rate, exon/intron ratio, and other alignment statistics.

Includes recommended quality control standards for user convenience:
Recommended Quality Thresholds:
- ✅ Valid Barcode Fraction: >70%
- ✅ Q30 Base Quality: >75% (for barcode and UMI regions)
- ✅ Reads Mapped Confidently to Transcriptome: >30%
- ✅ Fraction Reads in Cells: >50% (or >30% for nuclear samples)
- ✅ Mean Reads per Cell: >15,000

-----------

📄 singlecell.csv

A single-cell level quality control information table in CSV format, recording detailed statistical data for each cell barcode.

Core Purpose:
- Fine-grained QC: Allows users to perform more detailed cell filtering and analysis based on custom criteria.
- Input for Downstream Analysis: Can be used as cell metadata input for downstream analysis tools, supporting cell filtering and bead merging operations in VDJ analysis.
Content and Format:
- Each row represents a cell barcode.
- Major columns include: UMI count, gene count, mitochondrial gene fraction, and whether it was identified as a high-quality cell, bead merging information, etc.

-----------

📄 *_scRNA_report.html

An interactive comprehensive analysis report in HTML web format.

Core Purpose:
- Result Visualization: Intuitively displays key analysis results such as QC results, cell clustering, and marker genes in the form of interactive charts.
- Result Interpretation: Provides biological significance and technical explanations for various metrics to help users deeply interpret the data.
- Convenient Sharing: A single HTML file, easy to circulate and share.
Content and Format:
- Can be opened in any modern browser without an internet connection.
- For a detailed interpretation of the report, please refer to the Web Report Interpretation section below.

📄 File Format Description

Technical Specifications: Detailed descriptions of the standard formats used for output files.

📊 Market Matrix Format (`.mtx.gz`)

The Market Exchange Format (MEX) is a standard format used in single-cell analysis for storing sparse count matrices, offering advantages of space efficiency and high compatibility.

Core Advantages:
- Space Efficient: The sparse matrix only stores non-zero elements, which can greatly save storage space for single-cell data where over 95% of values are typically zero.
- Highly Compatible: As an international standard format, it can be directly read by almost all mainstream analysis tools like Seurat and Scanpy.

File Composition:

A complete MEX format dataset consists of the following three files:

File Name	Description
`matrix.mtx.gz`	A compressed sparse matrix file. The header contains matrix dimensions, and each subsequent line records the position (row/column index) and value of a non-zero element.
`barcodes.tsv.gz`	A compressed cell barcode file. Each line is a cell ID, and the line number corresponds to the matrix's column. The format is, for example, `CELL1_N2`, where `CELL1` is the cell ID and `N2` consists of two barcodes.
`features.tsv.gz`	A compressed feature (gene) file. Each line contains information like gene ID and gene name, and the line number corresponds to the matrix's row.

🗃️ AnnData Format (`.h5ad`)

Format Overview: AnnData ("Annotated Data") is a data structure designed for matrix-like data, particularly suitable for single-cell RNA sequencing data analysis. Based on the HDF5 format, it provides efficient data storage and access capabilities.

🏗️ Data Structure

📁 Component	🎯 Function	📏 Dimensions
X	Main expression matrix	n_cells × n_genes
obs	Cell metadata	n_cells × n_obs_features
var	Gene metadata	n_genes × n_var_features
obsm	Cell multidimensional data	n_cells × n_components
varm	Gene multidimensional data	n_genes × n_components
layers	Multi-layer data	n_cells × n_genes
uns	Unstructured data	Any object

📊 Web Report Interpretation

🎯 Overview: The HTML web report provides a comprehensive visual display and detailed interpretation of single-cell RNA sequencing analysis results, including the evaluation of key performance indicators, to help users quickly understand the experimental quality and analysis results.

The HTML web report is a comprehensive display platform for single-cell RNA sequencing analysis, integrating complete results from data quality control to downstream biological analysis. The report uses an interactive visual design to help users quickly assess experimental quality, understand analysis results, and guide future research directions.

💡 Usage Suggestion: It is recommended to review the metrics in the order they are presented in the report.

⚠️ Quality Standards: Each metric is provided with recommended thresholds and quality levels. Please conduct a comprehensive evaluation based on specific experimental goals.

📊 Main Content and Structure of the Report

🧬 Detailed Explanation of Core Analysis Metrics

🧬 Cell Metrics

🎯 Core Function: Cell identification, quality assessment, and gene expression statistics, providing key indicators of the overall effectiveness of the experiment.

📊 Quality Control Standards:

Note: The following standards are for reference only. Actual quality assessment should consider multiple factors such as tissue type, cell state, and experimental goals. Significant differences may exist between different samples, and it is recommended to make judgments based on the specific experimental context.

Metric Name	Recommended	Acceptable	Needs Improvement
Mean reads per cell	≥ 30,000	15,000–30,000	< 15,000
Median genes per cell	≥ 1,000	500–1,000	< 500
Fraction reads in cells	≥ 60%	30–60%	< 30%
Sequencing saturation	≥ 40%	20–40%	< 20%

🔍 Detailed Metric Explanations:

Metric Name	Detailed Explanation and Technical Requirements
Estimated number of cells Estimated Cell Count	Definition: The total number of valid cells (as opposed to background noise or empty droplets) identified from the sequencing data. Calculation Process: After merging cell barcodes from the same droplet, real cells are predicted based on an empty-droplet model (EmptyDrops). Quality Interpretation: Abnormal Causes: Inaccurate cell counting, cell lysis, poor sample or library quality, low sequencing depth.
Species Species Information	Definition: The species or reference genome version used for the analysis. Description: This information comes from the reference genome provided during library construction and is used to ensure the accuracy of alignment and annotation.
Mean reads per cell Mean Reads per Cell	Definition: The average number of raw sequencing reads allocated to each cell. Calculation: `Total number of raw sequencing reads / Estimated number of cells` Quality Interpretation: A value ≥ 30,000 is recommended to ensure sufficient transcript coverage.
Median/Mean UMI per cell Median/Mean UMIs per Cell	Definition: The median/mean number of unique molecular identifiers (UMIs) detected in each cell. Biological Significance: Used to assess the gene expression level of single-cell sequencing, more accurately reflecting the abundance of original mRNA molecules than read counts. Quality Interpretation: This metric is affected by cell type, sequencing depth, and library quality. A low value may indicate insufficient sequencing depth or poor sample quality.
Median/Mean genes per cell Median/Mean Genes per Cell	Definition: The median/mean number of genes detected within a single cell. Biological Significance: This metric directly reflects the complexity of the single-cell transcriptome and the sequencing depth. A higher value indicates better single-cell data quality. Quality Interpretation: Note: This value is highly dependent on cell type and sequencing depth. Cell types with low transcript content (such as blood cells) may have a lower value.
Total genes detected Total Genes Detected	Definition: The total number of genes detected in the entire sample, requiring each gene to have at least one UMI count in at least one cell. Biological Significance: Reflects the overall transcriptome complexity of the sample and whether the sequencing was comprehensive. Quality Interpretation: A low value may indicate insufficient sequencing depth or a uniform cell type in the sample.
Fraction reads in cells Fraction of Reads in Cells	Definition: The proportion of reads successfully assigned to high-quality cell IDs among all validly aligned reads (with valid barcodes/UMIs and confidently mapped to the transcriptome). Biological Significance: Reflects the efficiency of cell capture and the signal-to-noise ratio. Quality Interpretation: Quality Issues: A low proportion may indicate poor sample quality (e.g., extensive cell fragmentation releasing free-floating RNA) or abnormalities in library construction.
Sequencing saturation Sequencing Saturation	Definition: A metric to assess whether sequencing depth is sufficient, calculated as `1 - (number of deduplicated UMIs / total number of reads)`. Biological Significance: Reflects library complexity and the cost-effectiveness of sequencing. High saturation means that increasing sequencing depth yields diminishing returns in discovering new genes. Typical Range: A range of 40% – 85% is considered ideal.

-----------

🔬 Sequencing Metrics

🎯 Core Function: Basic quality assessment of sequencing data, including barcode identification rate, UMI quality, and sequencing accuracy.

📊 Quality Control Standards:

Note: The following standards are for reference only. Actual quality assessment should consider multiple factors such as tissue type, cell state, and experimental goals. Significant differences may exist between different samples, and it is recommended to make judgments based on the specific experimental context.

Metric Category	Recommended	Acceptable	Needs Improvement
Valid barcodes	≥ 80%	70–80%	< 70%
Valid UMIs	≥ 80%	70–80%	< 70%
Q30 Base Quality	≥ 85%	75–85%	< 75%

🔍 Detailed Metric Explanations:

Metric Name	Detailed Explanation and Technical Requirements
Number of reads Total Number of Reads	Definition: The total number of raw sequencing read pairs assigned to this sample. Significance: Represents the overall data volume of this sequencing run. Theoretically, a higher number of reads provides more comprehensive coverage of the cell's transcriptome.
Valid barcodes Valid Barcode Fraction	Definition: The proportion of all reads whose Cell Barcode can be matched to a preset whitelist (after error correction). Biological Significance: Reflects the effectiveness of cell labeling. Quality Interpretation: A very low proportion usually indicates sample quality issues leading to barcode degradation and adapter contamination, or a high error rate during the sequencing process.
Valid UMIs Valid UMI Fraction	Definition: The proportion of all reads whose Unique Molecular Identifier (UMI) sequence does not contain 'N' bases and is not a homopolymer (e.g., AAAAAA). Biological Significance: Reflects the sequencing quality of the UMI sequence, which is key for accurate molecular counting.
Q30 bases in barcode/UMI/read Q30 Base Fraction	Definition: The proportion of bases with a sequencing quality score of Q30 or higher in the cell barcode, UMI, and RNA read sequences. Significance: Q30 represents a sequencing error rate of less than 0.1%. This metric directly affects the accuracy of cell identification, molecular counting, and gene alignment.

Note: All proportions above are calculated based on the total number of raw sequencing reads (Number of Reads), ensuring comparability and consistency across metrics.

-----------

🗺️ Mapping Metrics

🎯 Core Function: To assess the quality of read alignment to the reference genome, including alignment rate, specificity, and genomic region distribution.

📊 Quality Control Standards:

Note: The following standards are for reference only. Actual quality assessment should consider multiple factors such as tissue type, cell state, and experimental goals. Significant differences may exist between different samples, and it is recommended to make judgments based on the specific experimental context.

Metric Name	Recommended	Acceptable	Needs Improvement
Reads mapped to genome	≥ 80%	50–80%	< 50%
Reads mapped confidently to transcriptome	≥ 50%	30-50%	< 30%
Reads mapped antisense to gene	< 10%	10-30%	> 30%

🔍 Detailed Metric Explanations:

Metric Name	Detailed Explanation and Technical Requirements
Reads mapped to genome Genome Alignment Rate	Definition: The proportion of all reads that successfully align to any location on the reference genome (including unique and multiple alignments). Quality Interpretation: Needs Attention: A rate below 50% may indicate sample contamination (e.g., bacteria) or species mismatch.
Reads mapped confidently to genome Confident Genome Alignment Rate	Definition: The proportion of all reads that align with high quality (STAR MAPQ value of 255) to a unique location on the genome. Technical Detail: For multi-mapping reads, they are corrected to confident reads in one specific case: when the read aligns to both an exonic region and one or more non-exonic regions, the pipeline accepts its alignment in the exonic region and retains it. Biological Significance: This forms the basis of valid data for gene expression quantification and regional analysis. A low proportion may be caused by repetitive sequences, poor sequence quality, or a mismatched reference genome.
Reads mapped confidently to transcriptome Confident Transcriptome Alignment Rate	Definition: The proportion of all reads that can be uniquely aligned with high confidence to a single gene (including exons and introns by default). Technical Detail: To ensure quantification accuracy, if a read's alignment region overlaps with multiple different genes, the read is considered of ambiguous origin and filtered out. Biological Significance: This is a core metric for assessing library quality and data reliability. A higher proportion means more valid data for downstream quantitative analysis and more reliable results.
Reads mapped confidently to exonic regions Exonic Region Alignment Rate	Definition: The proportion of reads confidently mapped to the genome that fall into annotated exonic regions. Technical Detail: A read is considered confidently mapped to an exonic region only if at least 50% of it falls within an exonic region. Biological Significance: This is the main source of mature mRNA and a core metric for assessing library quality. In standard whole-cell scRNA-seq, this proportion should be high.
Reads mapped confidently to intronic regions Intronic Region Alignment Rate	Definition: The proportion of reads confidently mapped to the genome that fall into annotated intronic regions. Technical Detail: A read is considered confidently mapped to an intronic region only if it does not meet the criteria for exonic region classification and intersects with an intronic region. Biological Significance: A high proportion usually indicates the capture of a large amount of unspliced pre-mRNA. This is expected in nuclear sequencing (snRNA-seq).
Reads mapped confidently to intergenic regions Intergenic Region Alignment Rate	Definition: The proportion of reads confidently mapped to the genome that do not fall into any annotated gene (including exons and introns). Quality Interpretation: An excessively high proportion may suggest incomplete gene annotation or non-specific amplification in the library.
Reads mapped antisense to gene Antisense Alignment Rate	Definition: The proportion of reads that successfully align to a gene region but in the opposite direction to the annotated gene. Quality Interpretation: An excessively high proportion may indicate directionality issues during library construction or the presence of unknown antisense transcripts.
Include introns Include Introns	Definition: Controls whether reads aligned to intronic regions are included in gene expression counts. Enabled State (Default): When set to `True`, reads from intronic regions are counted towards the expression of the corresponding gene. This mode captures gene activity more comprehensively, especially suitable for nuclear sequencing or scenarios requiring pre-mRNA analysis. Disabled State: When set to `False`, only exonic reads are counted towards gene expression. This mode focuses on the quantification of mature mRNA.

Note: All proportions above are calculated based on the total number of raw sequencing reads (Number of Reads), ensuring comparability and consistency across metrics.

-----------

📈 Interactive Visualization Chart Interpretation

🎯 Core Function: Provides comprehensive data visualization analysis, from cell quality control to a complete display of downstream biological analysis.

📊 Visualization Chart Group One: Cell Quality Control Analysis

📊 Barcode Rank Plot

Chart Function:
This plot distinguishes high-quality real cells from background noise by ranking all cells by their UMI count.

How to Interpret:

Visual Encoding: 🔵 Blue line (valid cells) | ⬜ Gray line (background noise) | 🔷 Blue gradient area (mixed region)
Chart Axes Explained:
- X-axis: Barcode Rank - Sorted by total UMI count in descending order (log scale)
- Y-axis: UMI Counts - Total UMI count for each cell (log scale)
- Interaction: Hover to display cell rank, UMI count, and the proportion of real cells in that segment
Quality Assessment Guide:
- Ideal Pattern: A clear "knee point" distinguishes real cells from the background, with a steep drop in the real cell region and a flat distribution in the background region.
- Abnormal Pattern: Lack of a clear knee point (cell concentration too low), or a gradual decline (background RNA too high).

-----------

📊 Droplet Beads Distribution

Chart Function:
Displays the distribution of the number of captured cell barcodes (Beads) in real cell droplets.

How to Interpret:

Theoretical Distribution: The distribution of beads in droplets theoretically follows a Poisson distribution, reflecting the statistical properties of the random capture process in the micro-reaction system.
Actual Influences: The final distribution is affected by experimental factors such as sequencing saturation, droplet size uniformity, and cell concentration.

-----------

📊 Cell Data Distribution

Chart Function:
Through three separate violin plots, it shows the distribution of high-quality cells across three key quality metrics: number of genes (nGenes), number of UMIs (nUMI), and mitochondrial gene percentage (percent.mt).

How to Interpret:

Number of Genes and UMIs: The higher the center of the distribution (the widest part), the higher the transcriptome complexity and capture efficiency of the cells.
Mitochondrial Gene Percentage: The distribution should be concentrated at a low percentage (usually < 10-20%). A high percentage may indicate cell apoptosis or stress.

📊 Visualization Chart Group Two: Downstream Biological Analysis

🎯 Core Function: A comprehensive display of cell clustering analysis, differential gene identification, cell type annotation, and sequencing depth assessment.

🌀 Cluster Analysis

Chart Function:
Using UMAP dimensionality reduction and the Louvain clustering algorithm, cells with similar gene expression patterns are grouped together in a 2D space, thereby identifying potential cell subpopulations.

How to Interpret:

Left Plot (Cell Type Clustering): Each point represents a cell, and different colors represent different cell clusters. Cells that are close in space have more similar gene expression profiles.
Right Plot (UMI Count Distribution): On the same UMAP space, a color gradient shows the total UMI count for each cell. This can be used to help assess the reliability of the clustering results, for example, whether certain clusters are composed of low-quality cells.

-----------

📈 Marker Genes Analysis

Chart Function:
Displays the characteristic differentially expressed genes for each cell cluster, used to identify and annotate different cell types.

How to Interpret:

Key Metrics Explained:
- P-val: The statistical significance p-value of differential expression. The smaller the value, the more significant the difference (Threshold: < 0.05 is significant, < 0.01 is highly significant).
- p_val_adj: The adjusted p-value after Bonferroni multiple testing correction, which controls the false positive rate (it is recommended to use the adjusted p-value for final screening).
- avg_log2FC: Average log2 fold change (on a log2 scale).
- pct.1 / pct.2: The proportion of cells expressing the gene in the target cluster versus other clusters.
Interactive Features: Cluster filtering (select a specific cluster from the dropdown menu) | Gene search (use the search box to quickly locate gene expression).

-----------

🧬 Cell Type Annotation

Chart Function:
On the UMAP plot, each cluster is labeled with a cell type inferred from a reference database (e.g., scHCL, scMCA).

How to Interpret:

Annotation Result: Provides a possible cell type label for each cluster.
Species Support: Human (Homo sapiens) / Mouse (Mus musculus). Cell type annotation is not provided for other species.
Usage Suggestion: The automatic annotation results are for reference only. Their accuracy depends on the quality of the reference database and the similarity of the sample. It is recommended to manually verify and correct them in conjunction with marker genes.

-----------

📊 Sequencing Saturation Curve

Chart Function:
Assesses the adequacy of sequencing depth and data complexity, i.e., whether further increasing the sequencing volume can lead to the discovery of more new genes or UMIs.

How to Interpret:

Axes: The X-axis is the average number of sequencing reads per cell, and the Y-axis is the saturation / median number of genes per cell.
Curve Trend: If the curve tends to flatten, it indicates that sequencing is approaching saturation, and increasing sequencing depth will not contribute much to the discovery of new genes. If the curve is still rising rapidly, it indicates that increasing sequencing may still yield significant benefits.

🎯 More Resources

Document Type	Resource Link and Description
🚀 Quick Start	Quick Start Guide - A complete tutorial for your first analysis.
⚙️ Parameter Reference	Parameter Reference Manual - Detailed descriptions of all configurable parameters.
🔬 Analysis Pipeline	Analysis Pipeline Description - Technical details of the entire analysis pipeline.
🔧 Installation & Configuration	Installation & Configuration Guide - System requirements, installation steps, and environment configuration.

💡 Tip

This document is continuously updated. If you find any errors or need additional information, please provide feedback.

📝 Document Version: 3.0 beta | Last Updated: 2025

🔬 DNBelab C Series HT scRNA Analysis Software
A High-Performance Single-Cell RNA Sequencing Data Analysis Pipeline

🧬 DNBelab C Series HT scRNA Analysis Output Documentation

📖 Overview

📁 Output Directory Structure

📋 Detailed File Description

🧬 Alignment and Annotation Files

📄 anno_decon_sorted.bam

📄 anno_decon_sorted.bam.bai

📈 Feature Matrix Files

📁 Filtered Gene Expression Matrix (filter_matrix/)

📁 Raw Gene Expression Matrix (raw_matrix/)

📄 filter_feature.h5ad

📊 Analysis Results Directory (analysis/)

📄 cluster.csv

📄 marker.csv

📄 QC_Cluster.h5ad

📝 Analysis Metrics Summary

📄 metrics_summary.xls

📄 singlecell.csv

📄 *_scRNA_report.html

📄 File Format Description

📊 Market Matrix Format (.mtx.gz)

🗃️ AnnData Format (.h5ad)

🏗️ Data Structure

📊 Web Report Interpretation

📊 Main Content and Structure of the Report

🧬 Detailed Explanation of Core Analysis Metrics

🧬 Cell Metrics

🔬 Sequencing Metrics

🗺️ Mapping Metrics

📈 Interactive Visualization Chart Interpretation

📊 Visualization Chart Group One: Cell Quality Control Analysis

📊 Barcode Rank Plot

📊 Droplet Beads Distribution

📊 Cell Data Distribution

📊 Visualization Chart Group Two: Downstream Biological Analysis

🌀 Cluster Analysis

📈 Marker Genes Analysis

🧬 Cell Type Annotation

📊 Sequencing Saturation Curve

🎯 More Resources

📚 Related Documentation

📁 Filtered Gene Expression Matrix (`filter_matrix/`)

📁 Raw Gene Expression Matrix (`raw_matrix/`)

📊 Analysis Results Directory (`analysis/`)

📊 Market Matrix Format (`.mtx.gz`)

🗃️ AnnData Format (`.h5ad`)