π¬ Main Analysis Pipeline (run) β’ π Reference Database Construction (mkref) β’ π Multi-sample Operations (multi)
$ dnbc4tools atac run --help
Usage: dnbc4tools atac run [OPTIONS]
optional arguments:
-h, --help show this help message and exit
Input Files:
Choose ONE input method: either --fastqs (directory) OR individual FASTQ files (-1 and -2).
--fastqs <DIR> Input directory containing paired-end FASTQ files. The pipeline automatically detects Read1/Read2 files. Example: ./fastq_dir
-1, --fastq1 <FILE> [<FILE> ...]
Read1 FASTQ file(s) for the ATAC library (supports wildcards and comma-separated lists). Example: sample1_L01_R1.fastq.gz,sample1_L02_R1.fastq.gz
-2, --fastq2 <FILE> [<FILE> ...]
Read2 FASTQ file(s) for the ATAC library (supports wildcards and comma-separated lists). Must match --fastq1 order. Example: sample1_L01_R2.fastq.gz,sample1_L02_R2.fastq.gz
Basic Settings:
-n, --name <STR> Unique identifier for the sample (e.g., sample1). Used for naming output files and reports.
-g, --genomeDir <DIR>
Path to reference genome directory. Must contain the required index and annotation resources.
-o, --outdir <DIR> Output directory for results and reports [default: current directory]. Example: ./output
-t, --threads <INT> Number of CPU threads for parallel processing [default: 10].
Library Settings:
Configure sequencing library settings and dark cycles.
Auto-detection is recommended for dark cycles.
Use --customize to specify sequence structure patterns when needed.
--darkreaction <STR> Dark cycle setting for ATAC library [default: auto]. Options: auto (automatic detection), R1R2 (both reads), R1 (Read1 only), R2 (Read2 only), unset (no dark cycles).
--customize <STR> Customize read structure for barcode/sequence extraction, format: <type>,<read>:<start>-<end> separated by ';'. Types: cb (cell barcode), R1 (sequence from Read1), R2 (sequence from Read2). Example:
"cb,R1:1-10;cb,R1:11-20;R1,R1:21-70;R2,R2:1-50".
Filtering Settings:
--forcecells <INT> Force pipeline to use exactly this number of cells, overriding detection (e.g., 5000).
--frags_cutoff <INT> Minimum number of unique fragments to retain a cell [default: 1000].
--tss_cutoff <FLOAT> Minimum TSS proportion threshold to retain a cell [default: 0] (e.g., 0.2).
--jaccard_cutoff <FLOAT>
Jaccard similarity threshold for merging beads (e.g., 0.02).
--merge_cutoff <INT> Minimum number of fragments when merging beads [default: 1000].
Analysis Settings:
--need_bam Enable generation of BAM files containing aligned reads. Note: generating BAM files increases computational time and disk space usage.
--sample_read_pairs <INT>
Subsample the specified number of read pairs from the input FASTQ files (e.g., 1000000).
β οΈ Essential parameters that must be specified for a successful analysis
-n, --name (Required)Provide a unique name for this analysis run.
Default: None
Example:
--name sample_001
-g, --genomeDir (Required)Specify the path to the reference genome directory.
mkref command.Default: None
Example:
--genomeDir /path/to/genome/database
π Choose one input method: Directory-based OR specify individual files
--fastqs (Method 1)Specify the path to the directory containing all FASTQ files.
--fastq1 / --fastq2.Default: None
Example:
--fastqs ./fastq_directory
-1, --fastq1 (Method 2A)Specify one or more Read1 FASTQ files individually.
*) to match files or a comma-separated list for multiple files.--fastq2 parameter, and the file order must match exactly.Default: None
Example:
--fastq1 sample1_L01_R1.fastq.gz,sample1_L02_R1.fastq.gz
-2, --fastq2 (Method 2B)Specify one or more Read2 FASTQ files individually.
*) to match files or a comma-separated list for multiple files.--fastq1 parameter, and the file order must match exactly.Default: None
Example:
--fastq2 sample1_L01_R2.fastq.gz,sample1_L02_R2.fastq.gz
β οΈ Input Method Selection:
- πΈ Method 1: Use
--fastqsto specify a directory containing paired FASTQ files.- πΈ Method 2: Use
-1, --fastq1and-2, --fastq2to specify R1 and R2 files respectively.
β οΈ Important Note: All files under a parameter must come from the same library, with consistent sequencing mode and dark reaction settings. Data from different libraries cannot be merged for analysis.
-o, --outdir (Optional)Specify the output directory for all analysis results and reports.
Default: ./ (current directory)
Example:
--outdir ./output_results
-t, --threads (Optional)Set the number of CPU threads to be used during the analysis.
Default: 10
Example:
--threads 16
--darkreaction (Optional)Configure the dark cycle settings for the ATAC library to ensure accurate cell barcode identification.
| Option | Description | Use Case |
|---|---|---|
auto |
(Default) Automatically detects dark cycle configuration and applies the optimal setting based on library type. | Applicable to all standard ATAC sequencing data. |
R1R2 |
Both Read1 and Read2 contain dark cycle bases. | Applicable for dual-end dark cycle sequencing designs. |
R1 |
Only the Read1 end contains dark cycle bases. | Applicable for single-end dark cycle (Read1 direction) sequencing designs. |
R2 |
Only the Read2 end contains dark cycle bases. | Applicable for single-end dark cycle (Read2 direction) sequencing designs. |
unset |
The library contains no dark cycle bases; no dark cycle correction is performed. | Applicable for non-MGI platforms or sequencing designs without dark cycles. |
Examples:
# Scenario 1: Initial analysis, using auto-detection
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref
# Scenario 2: Known library has dark cycles only on R1 and auto-analysis fails or identifies incorrectly
dnbc4tools atac run --name sample2 --fastqs ./fq --genomeDir ./ref --darkreaction R1
β οΈ Important Note: Incorrect settings can lead to cell barcode identification failure or loss of sequence information. Specify manually only if you understand the library structure or if auto-detection fails.
--customize (Advanced)Precisely define the extraction structure for barcodes and effective sequences (reads) for non-standard libraries.
--darkreaction are not applicable. It will override any --darkreaction settings."<type>,<read>:<start>-<end>";..., with multiple segments separated by semicolons (;), and coordinates are 1-based.| Type | Description | Example |
|---|---|---|
cb | Cell Barcode | cb,R1:1-10 |
R1 | Effective DNA sequence in Read1 | R1,R1:21-70 |
R2 | Effective DNA sequence in Read2 | R2,R2:1-50 |
Examples:
# Example 1: Assume R1 structure is: Barcode 1 (10bp) -> Barcode 2 (10bp) -> Insert (50bp). R2 structure is: Insert (50bp).
--customize "cb,R1:1-10;cb,R1:11-20;R1,R1:21-70;R2,R2:1-50"
# Example 2: Assume R1 structure is: Fixed (6bp) -> Barcode 1 (10bp) -> Fixed (6bp) -> Barcode 2 (10bp) -> Fixed (33bp) -> Insert (50bp). R2 structure is: Fixed (19bp) -> Insert (50bp).
--customize "cb,R1:7-16;cb,R1:23-32;R1,R1:66-115;R2,R2:20-69"
β οΈ Notes:
--forcecells (Optional)Force the pipeline to use an exact number of cells, overriding the software's automatic cell detection.
Default: None
Example:
# Force the output of 5000 cells for analysis
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --forcecells 5000
--frags_cutoff (Optional)Set the minimum number of unique fragments to retain a cell.
Default: 1000
Example:
# Lower the fragment threshold for cell filtering to 500
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --frags_cutoff 500
--tss_cutoff (Optional)Set the minimum proportion of fragments in TSS regions to retain a cell.
Default: 0 (no filtering)
Example:
# Filter out cells with a TSS region fragment proportion below 0.1
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --tss_cutoff 0.1
--jaccard_cutoff (Optional)The Jaccard similarity threshold for merging multiple barcodes (beads) that potentially belong to the same cell.
auto to let the software automatically determine the best threshold based on the OTSU algorithm.Default: auto
Example:
# Manually set the Jaccard similarity threshold to 0.02
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --jaccard_cutoff 0.02
--merge_cutoff (Optional)Set the minimum number of fragments required for a bead to be included in the Jaccard merging process.
Default: 500
Example:
# For a low-fragment sample, lower the threshold to 200 to include more beads for merging
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --merge_cutoff 200
--need_bam (Flag)Enable the generation of BAM format files.
Default: If not set, BAM files are not generated.
--sample_read_pairs (Optional)Extract a specified number of read pairs from the input FASTQ files for analysis.
Default: None (uses all data)
Example:
--sample_read_pairs 100000000
π‘ Analysis Recommendation
For the initial analysis, it is recommended to use the default parameters and then adjust them as needed based on the results report.
$ dnbc4tools atac mkref --help
Usage: dnbc4tools atac mkref [OPTIONS]
optional arguments:
-h, --help show this help message and exit
Input files:
Input genome FASTA and gene annotation GTF files. For mixed species analysis, use comma to separate multiple files.
--fasta <FILE> Path to reference genome FASTA file. Multiple files separated by comma
--ingtf <FILE> Path to gene annotation GTF file. Multiple files separated by comma
Basic settings:
--genomeDir <DIR> Output directory for reference files [default: current directory]
--species <STR> Species identifier. For mixed species analysis, use comma separated [default: undefined]
Advanced settings:
--tag <TYPE> Select type to generate BED file [default: transcript]
--chrM <STR> Mitochondrial chromosome identifier in reference genome [default: auto]
--chloroplast <STR> Chloroplast chromosome name, particularly recommended for plants, e.g. "Pt"
--prefix <STR> Filter chromosomes by prefix or full name. Not supported for mixed species
--kmer <INT> k-mer length, this determines the size of the substrings being extracted [default: 17]
--window <INT> Window size, this defines the number of consecutive k-mers within a window [default: 7]
--noindex Only generate ref.json without building genome index
--fasta (Required)Provide the reference genome sequence file.
Default: None
Example:
--fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa
--ingtf (Required)Provide the gene structure annotation file.
gene and transcript type annotation entries.Default: None
Example:
--ingtf Homo_sapiens.GRCh38.108.gtf
--genomeDir (Optional)Specify the output directory for the generated reference database.
<genomeDir/species>/
βββ fasta/
β βββ genome.fa
β βββ genome.index
βββ genes/
β βββ genes.gtf
βββ regions/
β βββ chrom.sizes
β βββ promoter.bed
β βββ tss.bed
βββ ref.json
Default: ./ (current directory)
Example:
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --genomeDir /database/scATAC/GRCh38
--species (Optional)Specify a species name for the reference database.
Homo_sapiens.Default: undefined
Example:
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --species Homo_sapiens
--tag (Optional)Select the source of information for generating the TSS (Transcription Start Site) file.
gene (uses gene start sites) or transcript (uses transcript start sites).transcript mode can yield more accurate TSS enrichment analysis results.Default: transcript
Example:
# Generate TSS file based on transcript start sites
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --tag transcript
--chrM (Optional)Specify the name of the mitochondrial chromosome.
chrM, MT).Default: auto
Example:
# If the mitochondrial chromosome name is "mitochondrion"
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --chrM mitochondrion
--chloroplast (Plant-specific)Specify the name of the chloroplast chromosome, recommended for plant samples.
Default: None
Example:
# Specify chloroplast chromosome name for Arabidopsis genome
dnbc4tools atac mkref --fasta TAIR10.fa --ingtf Athaliana.gtf --chloroplast Pt
--kmer (Optional)Set the k-mer length used during Chromap index construction.
Default: 17
Example:
# Lower k-mer length to reduce memory usage
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --kmer 15
--window (Optional)Set the window size used during Chromap index construction.
--kmer parameter for optimal results.Default: 7
Example:
# Adjust window size
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --window 5
--noindex (Flag)If this parameter is set, it will only generate the configuration file without building the genome index.
Default: Not set
Example:
# Generate only the configuration file, do not build the index
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --noindex
Tip
π Database Construction Notes:
π ref.json File Example:
{
"species": "Homo_sapiens",
"input_fasta_files": [
"genome.fa"
],
"input_gtf_files": [
"genes.gtf"
],
"genome": "/database/scATAC/Homo_sapiens/fasta/genome.fa",
"index": "/database/scATAC/Homo_sapiens/fasta/genome.index",
"gtf": "/database/scATAC/Homo_sapiens/genes/genes.gtf",
"chrmt": "chrM",
"chloroplast": "None",
"chromeSize": "/database/scATAC/Homo_sapiens/regions/chrom.sizes",
"tss": "/database/scATAC/Homo_sapiens/regions/tss.bed",
"promoter": "/database/scATAC/Homo_sapiens/regions/promoter.bed",
"version": "3.0beta",
"blacklist": "None",
"genomesize": "hs"
}
π Important Notes:
$ dnbc4tools atac multi
Usage: dnbc4tools atac multi [OPTIONS]
optional arguments:
-h, --help show this help message and exit
--list <STR> Path to the sample list file. Each line should contain sample name and FASTQ paths.
--outdir <DIR> Output directory. [default: current directory].
--threads <INT> Number of threads used for analysis.
--genomeDir <DIR> Path to the directory where genome files are stored.
--list (Required)Specify the path to the list file containing information for multiple samples.
\t) text file, UTF-8 encoding recommended.,) to separate.;) to separate.# Scenario 1: Sample A, with one pair of R1/R2 files
SampleA /path/to/SampleA_R1.fastq.gz;/path/to/SampleA_R2.fastq.gz
# Scenario 2: Sample B, with two pairs of R1/R2 files (files for the same Read are comma-separated)
SampleB /path/to/B_L01_R1.fq.gz,/path/to/B_L02_R1.fq.gz;/path/to/B_L01_R2.fq.gz,/path/to/B_L02_R2.fq.gz
Default: None
π Parameter Inheritance Note
For other analysis parameter settings, please refer to the corresponding parameters of thednbc4tools atac runcommand.
π‘ Tip
This document is continuously updated. If you find any errors or have information to add, your feedback is welcome.
π Document Version: 3.0 beta | Last Updated: 2025
π¬ DNBelab C Series HT scATAC Analysis Software
High-performance single-cell ATAC sequencing data analysis pipeline