🏠 Home β€’ δΈ­ζ–‡

🧬 DNBelab C Series HT scRNA Analysis Parameters

πŸ”¬ Main Analysis Pipeline (run) β€’ πŸ“Š Reference Database Construction (mkref) β€’ πŸ“‹ Multi-sample Operations (multi)


πŸ”¬ Main Analysis Pipeline (run)

πŸ“Š Usage

$ dnbc4tools rna run -h
usage: dnbc4tools rna run [OPTIONS]

optional arguments:
  -h, --help            show this help message and exit

Input Files:
  Choose ONE input method: either --fastqs (directory) OR all four individual FASTQ files (-c1, -c2, -i1, -i2).

  --fastqs <DIR>        Directory containing cDNA and oligo FASTQ subfolders (e.g., cDNA/sample_cdna_R1.fastq.gz, oligo/sample_oligo_R1.fastq.gz). The pipeline automatically detects paired-end files. Example: ./fastq_dir
  -c1, --cDNAfastq1 <FILE> [<FILE> ...]
                        Read1 FASTQ file(s) for cDNA (supports wildcards and comma-separated lists). Used for gene expression data. Example: sample1_R1.fastq.gz,sample2_R1.fastq.gz
  -c2, --cDNAfastq2 <FILE> [<FILE> ...]
                        Read2 FASTQ file(s) for cDNA (supports wildcards and comma-separated lists). Must match --cDNAfastq1 order. Example: sample1_R2.fastq.gz,sample2_R2.fastq.gz
  -i1, --oligofastq1 <FILE> [<FILE> ...]
                        Read1 FASTQ file(s) for oligo (supports wildcards and comma-separated lists). Used for barcode merging. Example: sample1_oligo_R1.fastq.gz
  -i2, --oligofastq2 <FILE> [<FILE> ...]
                        Read2 FASTQ file(s) for oligo (supports wildcards and comma-separated lists). Must match --oligofastq1 order. Example: sample1_oligo_R2.fastq.gz

Basic Settings:
  -n, --name <STR>      Unique identifier for the sample (e.g., sample1). Used for naming output files and reports.
  -g, --genomeDir <DIR>
                        Path to reference genome directory containing STAR index files. Example: ./genome_index
  -o, --outdir <DIR>    Output directory for results and reports [default: current directory]. Example: ./output
  -t, --threads <INT>   Number of CPU threads for parallel processing [default: all available cores] (e.g., 16).

Filtering Settings:
  --calling_method <STR>
                        Cell detection method [default: emptydrops]. Options: barcoderanks, emptydrops.
  --expectcells <INT>   Expected number of cells to guide detection [default: auto] (e.g., 3000).
  --forcecells <INT>    Force pipeline to use exactly this number of cells, overriding detection (e.g., 5000).
  --minumi <INT>        Minimum UMI count per cell to retain [default: 1000].

Library Settings:
  Configure sequencing library settings for barcode, UMI, and read structure.
  Auto-detection is recommended for chemistry and dark cycles.
  Use --customize twice for cDNA and oligo patterns, e.g., 
  --customize "cb,R1:1-10;cb,R1:11-20;umi,R1:21-30;R1,R2:1-100" --customize "cb,R1:1-10;cb,R1:11-20;R1,R2:1-30".

  --chemistry <STR>     Library chemistry version [default: auto]. Options: scRNAv1HT, scRNAv2HT, scRNAv3HT, scRNA5Pv1, auto (automatic detection).
  --darkreaction <STR>  Dark cycle setting for cDNA and oligo libraries [default: auto]. Provide two comma-separated values: <cDNA>,<oligo> Each field options: auto (automatic detection), R1R2 (both reads), R1 (Read1 only), unset (no
                        dark cycles). Examples: R1,R1R2; R1,R1; unset,unset.
  --customize <STR>     Custom read structure for barcode, UMI, or sequence extraction, format: <type>,<read>:<start>-<end> separated by ';'. Types: cb (cell barcode), umi (UMI) R1/R2 (sequence). Examples:
                        "cb,R1:1-10;cb,R1:11-20;umi,R1:21-30;R1,R2:1-100"

Analysis Settings:
  --no_introns          Exclude intronic reads from the expression matrix to increase specificity.
  --end5                Enable 5'-end scRNA-seq analysis for 5' gene expression profiling.
  --no_bam              Skip BAM file generation to save time and disk space.
  --sample_read_pairs <INT>
                        Subsample this number of cDNA read pairs for analysis (e.g., 1000000).

πŸ“ Parameter Description

πŸ”΄ Required Parameters

⚠️ Essential parameters that must be specified for a successful analysis

-n, --name (Required)

Provide a unique name for this analysis run.

Default: None

Example:

--name sample_001

-g, --genomeDir (Required)

Specify the path to the reference genome directory.

Default: None

Example:

--genomeDir /path/to/genome/database

🟒 Input File Parameters

πŸ“ Choose one input method: Directory-based OR specify individual files

--fastqs (Method 1)

Specify the path to the directory containing all FASTQ files.

Default: None

Example:

--fastqs ./fastq_directory

-c1, --cDNAfastq1 (Method 2A)

Specify one or more cDNA Read1 FASTQ files individually.

Default: None

Example:

--cDNAfastq1 sample_cDNA_L01_R1.fastq.gz,sample_cDNA_L02_R1.fastq.gz

-c2, --cDNAfastq2 (Method 2B)

Specify one or more cDNA Read2 FASTQ files individually.

Default: None

Example:

--cDNAfastq2 sample_cDNA_L01_R2.fastq.gz,sample_cDNA_L02_R2.fastq.gz

-i1, --oligofastq1 (Method 2C)

Specify one or more oligo Read1 FASTQ files individually.

Default: None

Example:

--oligofastq1 sample_oligo_R1.fastq.gz

-i2, --oligofastq2 (Method 2D)

Specify one or more oligo Read2 FASTQ files individually.

Default: None

Example:

--oligofastq2 sample_oligo_R2.fastq.gz

⚠️ Input Method Selection:

⚠️ Important Note: All files under a parameter must come from the same library, with consistent sequencing mode and dark reaction settings. Data from different libraries cannot be merged for analysis.


🟒 Basic Settings

-o, --outdir (Optional)

Specify the output directory for all analysis results and reports.

Default: ./ (current directory)

Example:

--outdir ./output_results

-t, --threads (Optional)

Set the number of CPU threads to be used during the analysis.

Default: Use all available CPU cores

Example:

--threads 16

🟒 Filtering Settings

--calling_method (Optional)

Set the cell identification method to distinguish real cells from empty droplets.

Default: emptydrops

Example:

# Switch to barcoderanks for cell identification
dnbc4tools rna run --name sample1 --fastqs ./fq --genomeDir ./ref --calling_method barcoderanks

--expectcells (Optional)

Set the expected number of recovered cells.

Default: auto

Example:

# Expect to recover 3000 cells
dnbc4tools rna run --name sample1 --fastqs ./fq --genomeDir ./ref --expectcells 3000

--forcecells (Optional)

Force the pipeline to use an exact number of cells, overriding the software's automatic cell detection.

Default: None

Example:

# Force the output of 5000 cells for analysis
dnbc4tools rna run --name sample1 --fastqs ./fq --genomeDir ./ref --forcecells 5000

--minumi (Optional)

Set the minimum UMI count to retain a cell.

Default: 1000

Example:

# Lower the UMI threshold for cell filtering to 500
dnbc4tools rna run --name sample1 --fastqs ./fq --genomeDir ./ref --minumi 500

Note

πŸ’‘ Cell Identification Analysis Recommendations

Cell identification is a critical step in single-cell analysis. Correct parameter settings and result interpretation directly impact the quality and reliability of subsequent analyses.

Click to view Diagnostics & Strategies

1. Abnormal Cell Count

Cell count too low
Symptom: Detected cells < 50% of expected.
Cause: UMI threshold too high, severe empty droplet contamination, poor library quality.
Solution: Lower --minumi, adjust --expectcells, check raw data quality.

Cell count too high
Symptom: Detected cells > 200% of expected.
Cause: Inaccurate cell counting, UMI threshold too low, high background noise.
Solution: Increase --minumi, use --forcecells to limit the count.

Abnormal UMI distribution
Symptom: UMI rank plot shows no clear "knee point".
Cause: Insufficient sequencing depth, poor library diversity, technical failure.
Solution: Increase sequencing depth, rebuild the library.

2. Abnormal Cell Identification Curve

Gradual decline with no knee point
Meaning: Difficult to distinguish between real cells and background empty droplets.
Solution: Use --forcecells to set a conservative cell count and combine with downstream QC.

Multiple knee points
Meaning: Presence of different cell populations or doublet contamination.
Solution: Choose the cell count corresponding to the main knee point and perform doublet detection and removal later.

Steep decline
Meaning: High-quality cells are clearly distinguished from the background, which is the ideal case.
Solution: Use the default emptydrops algorithm; you can consider lowering --minumi slightly.

Severe noise fluctuation
Meaning: High technical noise, poor data quality.
Solution: Increase the --minumi threshold, consider re-sequencing or optimizing experimental conditions.


Best Practice Tip

For the initial analysis, it is recommended to use the default parameters to get a preliminary result, then make targeted parameter adjustments based on the statistics and visualizations in the HTML report.


🟒 Library Settings

--chemistry (Optional)

Configure the chemistry version of the scRNA kit, which determines the sequence structure of barcodes and UMIs.

Default: auto

Example:

# Scenario: Library is known to be scRNAv2HT and auto-analysis failed
dnbc4tools rna run --name sample2 --fastqs ./fq --genomeDir ./ref --chemistry scRNAv2HT

⚠️ Important Note: Incorrect settings may lead to cell barcode identification failure. Specify manually only if you know the library structure or if auto-detection fails.

--darkreaction (Optional)

Configure the dark cycle settings for the cDNA and oligo libraries.

Default: auto

Examples:

# Example 1: cDNA library has dark cycle on R1, oligo library has dark cycles on both ends
--darkreaction R1,R1R2
# Example 2: Both libraries have dark cycles on R1 only
--darkreaction R1,R1
# Example 3: Neither library has dark cycles
--darkreaction unset,unset

⚠️ Important Note: Incorrect settings may lead to cell barcode identification failure. Specify manually only if you know the library structure or if auto-detection fails.

--customize (Advanced)

Precisely define the extraction structure for barcodes, UMIs, and effective sequences (reads) for non-standard libraries. This is an advanced feature that overrides --chemistry and --darkreaction settings.

Examples:

# For a cDNA library with structure: Barcode 1(1-10bp) + Barcode 2(11-20bp) + UMI(21-30bp) in R1; sequence(1-100bp) in R2
--customize "cb,R1:1-10;cb,R1:11-20;umi,R1:21-30;R1,R2:1-100"
# For a cDNA library with structure: Barcode 1(7-16bp) + Barcode 2(23-32bp) + UMI(38-47bp) in R1; sequence(1-100bp) in R2
--customize "cb,R1:7-16;cb,R1:23-32;umi,R1:38-47;R1,R2:1-100"
# For a 5'-end transcript cDNA library using data from both ends
--customize "cb,R1:1-10;cb,R1:11-20;umi,R1:21-30;R1,R1:31-120;R2,R2:1-150"
# Example: Custom sequence structures for cDNA and oligo libraries respectively
--customize "cb,R1:1-10;cb,R1:11-20;umi,R1:21-30;R1,R2:1-100" --customize "cb,R1:1-10;cb,R1:11-20;R1,R2:1-30"

⚠️ Risk Warning: Incorrect custom configurations can lead to data loss or analysis failure. Use only when standard configurations do not meet your needs.


🚩 Analysis Settings

--no_introns (Flag)

Enable this parameter to filter out reads from intronic regions during analysis.

Default: If not set, reads from intronic regions are included.

--end5 (Flag)

Enable 5'-end single-cell transcriptome data analysis mode.

Default: Not set.

--no_bam (Flag)

Enable this parameter to skip the generation of BAM files.

Default: If not set, BAM files are generated.

--sample_read_pairs (Optional)

Extract a specified number of read pairs from the input cDNA FASTQ files for analysis.

Default: None (uses all data)

Example:

--sample_read_pairs 100000000

πŸ’‘ Analysis Recommendation

For the initial analysis, it is recommended to use the default parameters and then adjust them as needed based on the results report.


πŸ“Š Reference Database Construction (mkref)

πŸ“Š Usage

$ dnbc4tools rna mkref -h
usage: dnbc4tools rna mkref [-h] 

optional arguments:
  -h, --help          show this help message and exit

Input Files:
  Input genome FASTA files and gene annotation GTF files. For mixed species analysis, separate multiple files with commas.

  --fasta <FILE>      Reference genome FASTA file path(s). Separate multiple files with commas
  --ingtf <FILE>      Gene annotation GTF file path(s). Separate multiple files with commas

Basic Settings:
  --genomeDir <DIR>   Output directory for generated reference files [default: current directory]
  --species <STR>     Species identifier(s). Use commas for mixed species analysis [default: undefined]
  --threads <INT>     Number of CPU threads for parallel processing [default: 10]

Advanced Settings:
  Advanced configuration options for reference genome building.
  Use these settings to customize STAR indexing behavior and resource usage.
  Parameters in extra-args will override default parameters if conflicts exist.
  Can be a space-separated string of parameters (e.g., "--sjdbOverhang 100 --runThreadN 16").

  --chrM <STR>        Mitochondrial chromosome identifier in reference genome [default: auto]
  --limitram <INT>    Maximum RAM (GB) allowed for index generation
  --extra-args <STR>  Additional STAR parameters to pass directly to STAR index generation
  --noindex           Skip STAR index generation step

πŸ“ Parameter Description

πŸ”΄ Required Parameters

--fasta (Required)

Provide the reference genome sequence file.

Default: None

Example:

--fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa

--ingtf (Required)

Provide the gene structure annotation file.

Default: None

Example:

--ingtf Homo_sapiens.GRCh38.108.gtf

Note

Dual-Species Analysis Configuration

For dual-species analysis, both --fasta and --ingtf parameters support providing file paths for two species, separated by commas.


🟒 Settings

--genomeDir (Optional)

Specify the output directory for the generated reference database.

Default: ./ (current directory)

Example:

dnbc4tools rna mkref --fasta genome.fa --ingtf genes.gtf --genomeDir /database/scRNA/GRCh38

--species (Optional)

Specify one or more species names for the reference database.

Default: undefined

Examples:

# Single species
--species Homo_sapiens
# Dual species (human + mouse)
--species hg38,mm10

--threads (Optional)

Set the number of CPU threads to be used during STAR index construction.

Default: 10

Example:

--threads 16

🟒 Advanced Settings

--chrM (Optional)

Specify the name of the mitochondrial chromosome.

Default: auto

Example:

# If the mitochondrial chromosome name is "mitochondrion"
dnbc4tools rna mkref --fasta genome.fa --ingtf genes.gtf --chrM mitochondrion

--limitram (Optional)

Set the maximum available memory (in GB) for the STAR genome index generation process.

Default: None

Example:

--limitram 64

--extra-args (Advanced)

Pass additional command-line arguments directly to STAR index generation.

Default: None

Example:

--extra-args "--sjdbOverhang 99 --runThreadN 20"

--noindex (Flag)

If this parameter is set, it will only generate the configuration file without building the genome index.

Default: Not set

Example:

# Generate only the configuration file, do not build the index
dnbc4tools rna mkref --fasta genome.fa --ingtf genes.gtf --noindex

Tip

πŸ“‹ Database Construction Technical Notes:

πŸ“‹ Single-Species ref.json File Example:

{
    "chrmt": "chrM",
    "genome": "/database/scRNA/Homo_sapiens/fasta/genome.fa",
    "genomeDir": "/database/scRNA/Homo_sapiens/star",
    "gtf": "/database/scRNA/Homo_sapiens/genes/genes.gtf",
    "input_fasta_files": [
        "genome.fa"
    ],
    "input_gtf_files": [
        "genes.gtf"
    ],
    "mtgenes": "/database/scRNA/Homo_sapiens/star/mtgene.list",
    "species": "Homo_sapiens",
    "version": "dnbc4tools 3.0beta"
}

πŸ“‹ Dual-Species ref.json File Example:

{
    "chrmt": "hg38_chrM,mm10_chrM",
    "genome": "/database/scRNA/hg38_and_mm10/fasta/genome.fa",
    "genomeDir": "/database/scRNA/hg38_and_mm10/star",
    "gtf": "/database/scRNA/hg38_and_mm10/genes/genes.gtf",
    "input_fasta_files": [
        "hg38_genome.fa",
        "mm10_genome.fa"
    ],
    "input_gtf_files": [
        "hg38_genes.gtf",
        "mm10_genes.gtf"
    ],
    "mtgenes": "/database/scRNA/hg38_and_mm10/star/mtgene.list",
    "species": "hg38_and_mm10",
    "version": "dnbc4tools 3.0beta"
}

πŸ“‹ Performance Optimization Recommendations:


πŸ“‹ Multi-sample Operations (multi)

πŸ“Š Usage

$ dnbc4tools rna multi -h
usage: dnbc4tools rna multi [-h] 

optional arguments:
  -h, --help            show this help message and exit
  --list <LIST>         Path to the sample list file. Each line should contain sample name, cDNA FASTQ paths, and oligo FASTQ paths.
  --genomeDir <DATABASE>
                        Path to the directory containing genome files.
  --outdir <OUTDIR>     Output directory. [default: current directory].
  --threads <CORENUM>   Number of threads used for analysis. [default: 20].
  --end5                Perform 5'-end single-cell transcriptome analysis.

πŸ“ Parameter Description

πŸ”΄ Required Parameters

--list (Required)

Specify the path to the list file containing information for multiple samples.

Default: None

Example:
# Example 1: SampleA, with 1 pair of R1/R2 files for cDNA and oligo each
SampleA	/path/to/A_cDNA_R1.fq.gz;/path/to/A_cDNA_R2.fq.gz	/path/to/A_oligo_R1.fq.gz;/path/to/A_oligo_R2.fq.gz
# Example 2: SampleB, with 2 pairs of R1/R2 files for cDNA, and 1 pair for oligo
SampleB	/path/to/B_cDNA_L01_R1.fq.gz,/path/to/B_cDNA_L02_R1.fq.gz;/path/to/B_cDNA_L01_R2.fq.gz,/path/to/B_cDNA_L02_R2.fq.gz	/path/to/B_oligo_R1.fq.gz;/path/to/B_oligo_R2.fq.gz

πŸ“ Parameter Inheritance Note

For other analysis parameter settings, please refer to the corresponding parameters of the dnbc4tools rna run command. All samples should use the same reference database.


πŸ’‘ Tip

This document is continuously updated. If you find any errors or have information to add, your feedback is welcome.

πŸ“ Document Version: 3.0 beta | Last Updated: 2025


🧬 DNBelab C Series HT scRNA Analysis Software
High-performance single-cell transcriptome data analysis pipeline