π¬ Main Analysis Pipeline (run) β’ π Reference Database Construction (mkref) β’ π Multi-sample Operations (multi)
$ dnbc4tools rna run -h
usage: dnbc4tools rna run [OPTIONS]
optional arguments:
-h, --help show this help message and exit
Input Files:
Choose ONE input method: either --fastqs (directory) OR all four individual FASTQ files (-c1, -c2, -i1, -i2).
--fastqs <DIR> Directory containing cDNA and oligo FASTQ subfolders (e.g., cDNA/sample_cdna_R1.fastq.gz, oligo/sample_oligo_R1.fastq.gz). The pipeline automatically detects paired-end files. Example: ./fastq_dir
-c1, --cDNAfastq1 <FILE> [<FILE> ...]
Read1 FASTQ file(s) for cDNA (supports wildcards and comma-separated lists). Used for gene expression data. Example: sample1_R1.fastq.gz,sample2_R1.fastq.gz
-c2, --cDNAfastq2 <FILE> [<FILE> ...]
Read2 FASTQ file(s) for cDNA (supports wildcards and comma-separated lists). Must match --cDNAfastq1 order. Example: sample1_R2.fastq.gz,sample2_R2.fastq.gz
-i1, --oligofastq1 <FILE> [<FILE> ...]
Read1 FASTQ file(s) for oligo (supports wildcards and comma-separated lists). Used for barcode merging. Example: sample1_oligo_R1.fastq.gz
-i2, --oligofastq2 <FILE> [<FILE> ...]
Read2 FASTQ file(s) for oligo (supports wildcards and comma-separated lists). Must match --oligofastq1 order. Example: sample1_oligo_R2.fastq.gz
Basic Settings:
-n, --name <STR> Unique identifier for the sample (e.g., sample1). Used for naming output files and reports.
-g, --genomeDir <DIR>
Path to reference genome directory containing STAR index files. Example: ./genome_index
-o, --outdir <DIR> Output directory for results and reports [default: current directory]. Example: ./output
-t, --threads <INT> Number of CPU threads for parallel processing [default: all available cores] (e.g., 16).
Filtering Settings:
--calling_method <STR>
Cell detection method [default: emptydrops]. Options: barcoderanks, emptydrops.
--expectcells <INT> Expected number of cells to guide detection [default: auto] (e.g., 3000).
--forcecells <INT> Force pipeline to use exactly this number of cells, overriding detection (e.g., 5000).
--minumi <INT> Minimum UMI count per cell to retain [default: 1000].
Library Settings:
Configure sequencing library settings for barcode, UMI, and read structure.
Auto-detection is recommended for chemistry and dark cycles.
Use --customize twice for cDNA and oligo patterns, e.g.,
--customize "cb,R1:1-10;cb,R1:11-20;umi,R1:21-30;R1,R2:1-100" --customize "cb,R1:1-10;cb,R1:11-20;R1,R2:1-30".
--chemistry <STR> Library chemistry version [default: auto]. Options: scRNAv1HT, scRNAv2HT, scRNAv3HT, scRNA5Pv1, auto (automatic detection).
--darkreaction <STR> Dark cycle setting for cDNA and oligo libraries [default: auto]. Provide two comma-separated values: <cDNA>,<oligo> Each field options: auto (automatic detection), R1R2 (both reads), R1 (Read1 only), unset (no
dark cycles). Examples: R1,R1R2; R1,R1; unset,unset.
--customize <STR> Custom read structure for barcode, UMI, or sequence extraction, format: <type>,<read>:<start>-<end> separated by ';'. Types: cb (cell barcode), umi (UMI) R1/R2 (sequence). Examples:
"cb,R1:1-10;cb,R1:11-20;umi,R1:21-30;R1,R2:1-100"
Analysis Settings:
--no_introns Exclude intronic reads from the expression matrix to increase specificity.
--end5 Enable 5'-end scRNA-seq analysis for 5' gene expression profiling.
--no_bam Skip BAM file generation to save time and disk space.
--sample_read_pairs <INT>
Subsample this number of cDNA read pairs for analysis (e.g., 1000000).
β οΈ Essential parameters that must be specified for a successful analysis
-n, --name (Required)Provide a unique name for this analysis run.
Default: None
Example:
--name sample_001
-g, --genomeDir (Required)Specify the path to the reference genome directory.
mkref command.Default: None
Example:
--genomeDir /path/to/genome/database
π Choose one input method: Directory-based OR specify individual files
--fastqs (Method 1)Specify the path to the directory containing all FASTQ files.
--cDNAfastq1 / --cDNAfastq2 / --oligofastq1 / --oligofastq2.Default: None
Example:
--fastqs ./fastq_directory
-c1, --cDNAfastq1 (Method 2A)Specify one or more cDNA Read1 FASTQ files individually.
*) to match files or a comma-separated list for multiple files.--cDNAfastq2 parameter, and the file order must match exactly.Default: None
Example:
--cDNAfastq1 sample_cDNA_L01_R1.fastq.gz,sample_cDNA_L02_R1.fastq.gz
-c2, --cDNAfastq2 (Method 2B)Specify one or more cDNA Read2 FASTQ files individually.
*) to match files or a comma-separated list for multiple files.--cDNAfastq1 parameter, and the file order must match exactly.Default: None
Example:
--cDNAfastq2 sample_cDNA_L01_R2.fastq.gz,sample_cDNA_L02_R2.fastq.gz
-i1, --oligofastq1 (Method 2C)Specify one or more oligo Read1 FASTQ files individually.
*) to match files or a comma-separated list for multiple files.--oligofastq2 parameter, and the file order must match exactly.Default: None
Example:
--oligofastq1 sample_oligo_R1.fastq.gz
-i2, --oligofastq2 (Method 2D)Specify one or more oligo Read2 FASTQ files individually.
*) to match files or a comma-separated list for multiple files.--oligofastq1 parameter, and the file order must match exactly.Default: None
Example:
--oligofastq2 sample_oligo_R2.fastq.gz
β οΈ Input Method Selection:
- πΈ Method 1: Use
--fastqsto specify a directory containing cDNA and oligo subfolders.- πΈ Method 2: Use
-c1, --cDNAfastq1,-c2, --cDNAfastq2,-i1, --oligofastq1,-i2, --oligofastq2to specify R1 and R2 files respectively.
β οΈ Important Note: All files under a parameter must come from the same library, with consistent sequencing mode and dark reaction settings. Data from different libraries cannot be merged for analysis.
-o, --outdir (Optional)Specify the output directory for all analysis results and reports.
Default: ./ (current directory)
Example:
--outdir ./output_results
-t, --threads (Optional)Set the number of CPU threads to be used during the analysis.
Default: Use all available CPU cores
Example:
--threads 16
--calling_method (Optional)Set the cell identification method to distinguish real cells from empty droplets.
Default: emptydrops
Example:
# Switch to barcoderanks for cell identification
dnbc4tools rna run --name sample1 --fastqs ./fq --genomeDir ./ref --calling_method barcoderanks
--expectcells (Optional)Set the expected number of recovered cells.
auto mode is recommended, which automatically estimates the cell count based on UMI distribution features. If the effective cell count is known, you can also manually set it to 50% of that number as a preliminary screening basis.Default: auto
Example:
# Expect to recover 3000 cells
dnbc4tools rna run --name sample1 --fastqs ./fq --genomeDir ./ref --expectcells 3000
--forcecells (Optional)Force the pipeline to use an exact number of cells, overriding the software's automatic cell detection.
Default: None
Example:
# Force the output of 5000 cells for analysis
dnbc4tools rna run --name sample1 --fastqs ./fq --genomeDir ./ref --forcecells 5000
--minumi (Optional)Set the minimum UMI count to retain a cell.
Default: 1000
Example:
# Lower the UMI threshold for cell filtering to 500
dnbc4tools rna run --name sample1 --fastqs ./fq --genomeDir ./ref --minumi 500
Note
Cell identification is a critical step in single-cell analysis. Correct parameter settings and result interpretation directly impact the quality and reliability of subsequent analyses.
1. Abnormal Cell Count
Cell count too low
Symptom: Detected cells < 50% of expected.
Cause: UMI threshold too high, severe empty droplet contamination, poor library quality.
Solution: Lower --minumi, adjust --expectcells, check raw data quality.
Cell count too high
Symptom: Detected cells > 200% of expected.
Cause: Inaccurate cell counting, UMI threshold too low, high background noise.
Solution: Increase --minumi, use --forcecells to limit the count.
Abnormal UMI distribution
Symptom: UMI rank plot shows no clear "knee point".
Cause: Insufficient sequencing depth, poor library diversity, technical failure.
Solution: Increase sequencing depth, rebuild the library.
2. Abnormal Cell Identification Curve
Gradual decline with no knee point
Meaning: Difficult to distinguish between real cells and background empty droplets.
Solution: Use --forcecells to set a conservative cell count and combine with downstream QC.
Multiple knee points
Meaning: Presence of different cell populations or doublet contamination.
Solution: Choose the cell count corresponding to the main knee point and perform doublet detection and removal later.
Steep decline
Meaning: High-quality cells are clearly distinguished from the background, which is the ideal case.
Solution: Use the default emptydrops algorithm; you can consider lowering --minumi slightly.
Severe noise fluctuation
Meaning: High technical noise, poor data quality.
Solution: Increase the --minumi threshold, consider re-sequencing or optimizing experimental conditions.
Best Practice Tip
For the initial analysis, it is recommended to use the default parameters to get a preliminary result, then make targeted parameter adjustments based on the statistics and visualizations in the HTML report.
--chemistry (Optional)Configure the chemistry version of the scRNA kit, which determines the sequence structure of barcodes and UMIs.
scRNAv1HT, scRNAv2HT, scRNAv3HT, scRNA5Pv1Default: auto
Example:
# Scenario: Library is known to be scRNAv2HT and auto-analysis failed
dnbc4tools rna run --name sample2 --fastqs ./fq --genomeDir ./ref --chemistry scRNAv2HT
β οΈ Important Note: Incorrect settings may lead to cell barcode identification failure. Specify manually only if you know the library structure or if auto-detection fails.
--darkreaction (Optional)Configure the dark cycle settings for the cDNA and oligo libraries.
<cDNA_setting>,<oligo_setting> (comma-separated).auto (auto-detection), R1R2 (both ends), R1 (R1 only), unset (none).Default: auto
Examples:
# Example 1: cDNA library has dark cycle on R1, oligo library has dark cycles on both ends
--darkreaction R1,R1R2
# Example 2: Both libraries have dark cycles on R1 only
--darkreaction R1,R1
# Example 3: Neither library has dark cycles
--darkreaction unset,unset
β οΈ Important Note: Incorrect settings may lead to cell barcode identification failure. Specify manually only if you know the library structure or if auto-detection fails.
--customize (Advanced)Precisely define the extraction structure for barcodes, UMIs, and effective sequences (reads) for non-standard libraries. This is an advanced feature that overrides --chemistry and --darkreaction settings.
"<type>,<read>:<start>-<end>", with multiple segments separated by semicolons (;).
cb: Cell Barcodeumi: UMI (Unique Molecular Identifier)R1: Effective DNA sequence in Read1R2: Effective DNA sequence in Read2 (for paired-end sequencing only)--customize parameter twice, once for the cDNA library and once for the oligo library.Examples:
# For a cDNA library with structure: Barcode 1(1-10bp) + Barcode 2(11-20bp) + UMI(21-30bp) in R1; sequence(1-100bp) in R2
--customize "cb,R1:1-10;cb,R1:11-20;umi,R1:21-30;R1,R2:1-100"
# For a cDNA library with structure: Barcode 1(7-16bp) + Barcode 2(23-32bp) + UMI(38-47bp) in R1; sequence(1-100bp) in R2
--customize "cb,R1:7-16;cb,R1:23-32;umi,R1:38-47;R1,R2:1-100"
# For a 5'-end transcript cDNA library using data from both ends
--customize "cb,R1:1-10;cb,R1:11-20;umi,R1:21-30;R1,R1:31-120;R2,R2:1-150"
# Example: Custom sequence structures for cDNA and oligo libraries respectively
--customize "cb,R1:1-10;cb,R1:11-20;umi,R1:21-30;R1,R2:1-100" --customize "cb,R1:1-10;cb,R1:11-20;R1,R2:1-30"
β οΈ Risk Warning: Incorrect custom configurations can lead to data loss or analysis failure. Use only when standard configurations do not meet your needs.
--no_introns (Flag)Enable this parameter to filter out reads from intronic regions during analysis.
Default: If not set, reads from intronic regions are included.
--end5 (Flag)Enable 5'-end single-cell transcriptome data analysis mode.
Default: Not set.
--no_bam (Flag)Enable this parameter to skip the generation of BAM files.
Default: If not set, BAM files are generated.
--sample_read_pairs (Optional)Extract a specified number of read pairs from the input cDNA FASTQ files for analysis.
Default: None (uses all data)
Example:
--sample_read_pairs 100000000
π‘ Analysis Recommendation
For the initial analysis, it is recommended to use the default parameters and then adjust them as needed based on the results report.
$ dnbc4tools rna mkref -h
usage: dnbc4tools rna mkref [-h]
optional arguments:
-h, --help show this help message and exit
Input Files:
Input genome FASTA files and gene annotation GTF files. For mixed species analysis, separate multiple files with commas.
--fasta <FILE> Reference genome FASTA file path(s). Separate multiple files with commas
--ingtf <FILE> Gene annotation GTF file path(s). Separate multiple files with commas
Basic Settings:
--genomeDir <DIR> Output directory for generated reference files [default: current directory]
--species <STR> Species identifier(s). Use commas for mixed species analysis [default: undefined]
--threads <INT> Number of CPU threads for parallel processing [default: 10]
Advanced Settings:
Advanced configuration options for reference genome building.
Use these settings to customize STAR indexing behavior and resource usage.
Parameters in extra-args will override default parameters if conflicts exist.
Can be a space-separated string of parameters (e.g., "--sjdbOverhang 100 --runThreadN 16").
--chrM <STR> Mitochondrial chromosome identifier in reference genome [default: auto]
--limitram <INT> Maximum RAM (GB) allowed for index generation
--extra-args <STR> Additional STAR parameters to pass directly to STAR index generation
--noindex Skip STAR index generation step
--fasta (Required)Provide the reference genome sequence file.
Default: None
Example:
--fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa
--ingtf (Required)Provide the gene structure annotation file.
gene/transcript, exon type annotation entries.gene_id/gene_name, transcript_id/transcript_name attributes.Default: None
Example:
--ingtf Homo_sapiens.GRCh38.108.gtf
Note
Dual-Species Analysis Configuration
For dual-species analysis, both --fasta and --ingtf parameters support providing file paths for two species, separated by commas.
--fasta human.fa,mouse.fa --ingtf human.gtf,mouse.gtf--species parameter is strictly consistent, meaning each FASTA file corresponds to its respective GTF file and species parameter in the list.--genomeDir (Optional)Specify the output directory for the generated reference database.
genomeDir/ βββ fasta/ β βββ genome.fa # Processed genome sequence file βββ genes/ β βββ genes.gtf # Processed gene annotation file βββ star/ β βββ SA # STAR index file β βββ SAindex # STAR index core file β βββ chrLength.txt # Chromosome length information β βββ chrName.txt # Chromosome name information β βββ chrNameLength.txt # Chromosome name and length β βββ chrStart.txt # Chromosome start position β βββ Genome # Genome sequence compressed file β βββ genomeParameters.txt # Genome parameter configuration β βββ Log.out # STAR index construction log β βββ sjdbInfo.txt # Splice junction database information β βββ sjdbList.fromGTF.out.tab # Splice junctions extracted from GTF β βββ sjdbList.out.tab # List of all splice junctions β βββ mtgene.list # List of mitochondrial genes βββ ref.json # Database configuration and metadata file
Default: ./ (current directory)
Example:
dnbc4tools rna mkref --fasta genome.fa --ingtf genes.gtf --genomeDir /database/scRNA/GRCh38
--species (Optional)Specify one or more species names for the reference database.
hg38,mm10).--fasta and --ingtf files.hg38_GENE1) and separates statistical information in the results.Providing this parameter for specific species enables automatic downstream cell type annotation.
Homo_sapiens (or hg38), Mus_musculus (or mm10).Default: undefined
Examples:
# Single species
--species Homo_sapiens
# Dual species (human + mouse)
--species hg38,mm10
--threads (Optional)Set the number of CPU threads to be used during STAR index construction.
Default: 10
Example:
--threads 16
--chrM (Optional)Specify the name of the mitochondrial chromosome.
chrM, MT).Default: auto
Example:
# If the mitochondrial chromosome name is "mitochondrion"
dnbc4tools rna mkref --fasta genome.fa --ingtf genes.gtf --chrM mitochondrion
--limitram (Optional)Set the maximum available memory (in GB) for the STAR genome index generation process.
Default: None
Example:
--limitram 64
--extra-args (Advanced)Pass additional command-line arguments directly to STAR index generation.
Default: None
Example:
--extra-args "--sjdbOverhang 99 --runThreadN 20"
--noindex (Flag)If this parameter is set, it will only generate the configuration file without building the genome index.
Default: Not set
Example:
# Generate only the configuration file, do not build the index
dnbc4tools rna mkref --fasta genome.fa --ingtf genes.gtf --noindex
Tip
π Database Construction Technical Notes:
genomeSAindexNbases and genomeChrBinNbits.ref.json file will be generated in the database directory to record all key configuration information.π Single-Species ref.json File Example:
{
"chrmt": "chrM",
"genome": "/database/scRNA/Homo_sapiens/fasta/genome.fa",
"genomeDir": "/database/scRNA/Homo_sapiens/star",
"gtf": "/database/scRNA/Homo_sapiens/genes/genes.gtf",
"input_fasta_files": [
"genome.fa"
],
"input_gtf_files": [
"genes.gtf"
],
"mtgenes": "/database/scRNA/Homo_sapiens/star/mtgene.list",
"species": "Homo_sapiens",
"version": "dnbc4tools 3.0beta"
}
π Dual-Species ref.json File Example:
{
"chrmt": "hg38_chrM,mm10_chrM",
"genome": "/database/scRNA/hg38_and_mm10/fasta/genome.fa",
"genomeDir": "/database/scRNA/hg38_and_mm10/star",
"gtf": "/database/scRNA/hg38_and_mm10/genes/genes.gtf",
"input_fasta_files": [
"hg38_genome.fa",
"mm10_genome.fa"
],
"input_gtf_files": [
"hg38_genes.gtf",
"mm10_genes.gtf"
],
"mtgenes": "/database/scRNA/hg38_and_mm10/star/mtgene.list",
"species": "hg38_and_mm10",
"version": "dnbc4tools 3.0beta"
}
π Performance Optimization Recommendations:
$ dnbc4tools rna multi -h
usage: dnbc4tools rna multi [-h]
optional arguments:
-h, --help show this help message and exit
--list <LIST> Path to the sample list file. Each line should contain sample name, cDNA FASTQ paths, and oligo FASTQ paths.
--genomeDir <DATABASE>
Path to the directory containing genome files.
--outdir <OUTDIR> Output directory. [default: current directory].
--threads <CORENUM> Number of threads used for analysis. [default: 20].
--end5 Perform 5'-end single-cell transcriptome analysis.
--list (Required)Specify the path to the list file containing information for multiple samples.
\t) text file, UTF-8 encoding recommended.,).;).Default: None
# Example 1: SampleA, with 1 pair of R1/R2 files for cDNA and oligo each
SampleA /path/to/A_cDNA_R1.fq.gz;/path/to/A_cDNA_R2.fq.gz /path/to/A_oligo_R1.fq.gz;/path/to/A_oligo_R2.fq.gz
# Example 2: SampleB, with 2 pairs of R1/R2 files for cDNA, and 1 pair for oligo
SampleB /path/to/B_cDNA_L01_R1.fq.gz,/path/to/B_cDNA_L02_R1.fq.gz;/path/to/B_cDNA_L01_R2.fq.gz,/path/to/B_cDNA_L02_R2.fq.gz /path/to/B_oligo_R1.fq.gz;/path/to/B_oligo_R2.fq.gz
π Parameter Inheritance Note
For other analysis parameter settings, please refer to the corresponding parameters of the
dnbc4tools rna runcommand. All samples should use the same reference database.
π‘ Tip
This document is continuously updated. If you find any errors or have information to add, your feedback is welcome.
π Document Version: 3.0 beta | Last Updated: 2025
𧬠DNBelab C Series HT scRNA Analysis Software
High-performance single-cell transcriptome data analysis pipeline