🧬 DNBelab C Series HT scATAC Analysis Parameters

🔬 Main Analysis Pipeline (run) • 📊 Reference Database Construction (mkref) • 📋 Multi-sample Operations (multi)

🔬 Main Analysis Pipeline (run)

📊 Usage

$ dnbc4tools atac run --help
Usage: dnbc4tools atac run [OPTIONS]

optional arguments:
  -h, --help            show this help message and exit

Input Files:
  Choose ONE input method: either --fastqs (directory) OR individual FASTQ files (-1 and -2).

  --fastqs <DIR>        Input directory containing paired-end FASTQ files. The pipeline automatically detects Read1/Read2 files. Example: ./fastq_dir
  -1, --fastq1 <FILE> [<FILE> ...]
                        Read1 FASTQ file(s) for the ATAC library (supports wildcards and comma-separated lists). Example: sample1_L01_R1.fastq.gz,sample1_L02_R1.fastq.gz
  -2, --fastq2 <FILE> [<FILE> ...]
                        Read2 FASTQ file(s) for the ATAC library (supports wildcards and comma-separated lists). Must match --fastq1 order. Example: sample1_L01_R2.fastq.gz,sample1_L02_R2.fastq.gz

Basic Settings:
  -n, --name <STR>      Unique identifier for the sample (e.g., sample1). Used for naming output files and reports.
  -g, --genomeDir <DIR>
                        Path to reference genome directory. Must contain the required index and annotation resources.
  -o, --outdir <DIR>    Output directory for results and reports [default: current directory]. Example: ./output
  -t, --threads <INT>   Number of CPU threads for parallel processing [default: 10].

Library Settings:
  Configure sequencing library settings and dark cycles.
  Auto-detection is recommended for dark cycles.
  Use --customize to specify sequence structure patterns when needed.

  --darkreaction <STR>  Dark cycle setting for ATAC library [default: auto]. Options: auto (automatic detection), R1R2 (both reads), R1 (Read1 only), R2 (Read2 only), unset (no dark cycles).
  --customize <STR>     Customize read structure for barcode/sequence extraction, format: <type>,<read>:<start>-<end> separated by ';'. Types: cb (cell barcode), R1 (sequence from Read1), R2 (sequence from Read2). Example:
                        "cb,R1:1-10;cb,R1:11-20;R1,R1:21-70;R2,R2:1-50".

Filtering Settings:
  --forcecells <INT>    Force pipeline to use exactly this number of cells, overriding detection (e.g., 5000).
  --frags_cutoff <INT>  Minimum number of unique fragments to retain a cell [default: 1000].
  --tss_cutoff <FLOAT>  Minimum TSS proportion threshold to retain a cell [default: 0] (e.g., 0.2).
  --jaccard_cutoff <FLOAT>
                        Jaccard similarity threshold for merging beads (e.g., 0.02).
  --merge_cutoff <INT>  Minimum number of fragments when merging beads [default: 1000].

Analysis Settings:
  --need_bam            Enable generation of BAM files containing aligned reads. Note: generating BAM files increases computational time and disk space usage.
  --sample_read_pairs <INT>
                        Subsample the specified number of read pairs from the input FASTQ files (e.g., 1000000).

📝 Parameter Description

🔴 Required Parameters

⚠️ Essential parameters that must be specified for a successful analysis

`-n, --name` (Required)

Provide a unique name for this analysis run.

Function: This name will be used as a prefix for all output files and the HTML report.
Display: In the final web report, this name will be shown as the Sample ID.

Default: None

Example:

--name sample_001

`-g, --genomeDir` (Required)

Specify the path to the reference genome directory.

Requirement: The directory must contain the index and annotation resources generated by the mkref command.
Content: Includes genome sequence, TSS file, alignment index, etc.

Default: None

Example:

--genomeDir /path/to/genome/database

🟢 Input File Parameters

📁 Choose one input method: Directory-based OR specify individual files

`--fastqs` (Method 1)

Specify the path to the directory containing all FASTQ files.

Function: The pipeline will automatically detect paired Read1 and Read2 files in the directory.
Note: This is a convenience option and cannot be used simultaneously with --fastq1 / --fastq2.

Default: None

Example:

--fastqs ./fastq_directory

`-1, --fastq1` (Method 2A)

Specify one or more Read1 FASTQ files individually.

Support: You can use wildcards (*) to match files or a comma-separated list for multiple files.
Requirement: Must be used in pairs with the --fastq2 parameter, and the file order must match exactly.

Default: None

Example:

--fastq1 sample1_L01_R1.fastq.gz,sample1_L02_R1.fastq.gz

`-2, --fastq2` (Method 2B)

Specify one or more Read2 FASTQ files individually.

Support: You can use wildcards (*) to match files or a comma-separated list for multiple files.
Requirement: Must be used in pairs with the --fastq1 parameter, and the file order must match exactly.

Default: None

Example:

--fastq2 sample1_L01_R2.fastq.gz,sample1_L02_R2.fastq.gz

⚠️ Input Method Selection:

🔸 Method 1: Use --fastqs to specify a directory containing paired FASTQ files.

🔸 Method 2: Use -1, --fastq1 and -2, --fastq2 to specify R1 and R2 files respectively.

⚠️ Important Note: All files under a parameter must come from the same library, with consistent sequencing mode and dark reaction settings. Data from different libraries cannot be merged for analysis.

🟢 Basic Settings

`-o, --outdir` (Optional)

Specify the output directory for all analysis results and reports.

Function: All analysis results will be saved in this directory, and the pipeline will automatically create a structured subdirectory named after the sample.

Default: ./ (current directory)

Example:

--outdir ./output_results

`-t, --threads` (Optional)

Set the number of CPU threads to be used during the analysis.

Function: Increasing the number of threads can significantly speed up the analysis.
Recommendation: Adjust based on the number of available CPU cores for optimal performance.

Default: 10

Example:

--threads 16

🟢 Library Settings

`--darkreaction` (Optional)

Configure the dark cycle settings for the ATAC library to ensure accurate cell barcode identification.

Function: Guides the software to correctly parse dark reaction cycles generated by the sequencing chemistry (e.g., on MGI platforms).
Smart Detection: By default, the software automatically detects data characteristics to select the appropriate mode. Highly recommended for initial analysis.

Detailed Configuration Options

Option	Description	Use Case
`auto`	(Default) Automatically detects dark cycle configuration and applies the optimal setting based on library type.	Applicable to all standard ATAC sequencing data.
`R1R2`	Both Read1 and Read2 contain dark cycle bases.	Applicable for dual-end dark cycle sequencing designs.
`R1`	Only the Read1 end contains dark cycle bases.	Applicable for single-end dark cycle (Read1 direction) sequencing designs.
`R2`	Only the Read2 end contains dark cycle bases.	Applicable for single-end dark cycle (Read2 direction) sequencing designs.
`unset`	The library contains no dark cycle bases; no dark cycle correction is performed.	Applicable for non-MGI platforms or sequencing designs without dark cycles.

Examples:

# Scenario 1: Initial analysis, using auto-detection
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref

# Scenario 2: Known library has dark cycles only on R1 and auto-analysis fails or identifies incorrectly
dnbc4tools atac run --name sample2 --fastqs ./fq --genomeDir ./ref --darkreaction R1

⚠️ Important Note: Incorrect settings can lead to cell barcode identification failure or loss of sequence information. Specify manually only if you understand the library structure or if auto-detection fails.

`--customize` (Advanced)

Precisely define the extraction structure for barcodes and effective sequences (reads) for non-standard libraries.

Function: This parameter provides ultimate control when the preset modes of --darkreaction are not applicable. It will override any --darkreaction settings.
Syntax: "<type>,<read>:<start>-<end>";..., with multiple segments separated by semicolons (;), and coordinates are 1-based.

Parameter Type Details

Type	Description	Example
`cb`	Cell Barcode	`cb,R1:1-10`
`R1`	Effective DNA sequence in Read1	`R1,R1:21-70`
`R2`	Effective DNA sequence in Read2	`R2,R2:1-50`

Examples:

# Example 1: Assume R1 structure is: Barcode 1 (10bp) -> Barcode 2 (10bp) -> Insert (50bp). R2 structure is: Insert (50bp).
--customize "cb,R1:1-10;cb,R1:11-20;R1,R1:21-70;R2,R2:1-50"

# Example 2: Assume R1 structure is: Fixed (6bp) -> Barcode 1 (10bp) -> Fixed (6bp) -> Barcode 2 (10bp) -> Fixed (33bp) -> Insert (50bp). R2 structure is: Fixed (19bp) -> Insert (50bp).
--customize "cb,R1:7-16;cb,R1:23-32;R1,R1:66-115;R2,R2:20-69"

⚠️ Notes:

Must use quotes: The entire string must be enclosed in double quotes due to special characters.
Accurate coordinates: The coordinate range cannot exceed the actual read length in the FASTQ file, or it will cause a parsing failure.

🟢 Filtering Settings

`--forcecells` (Optional)

Force the pipeline to use an exact number of cells, overriding the software's automatic cell detection.

Function: Use when you want to analyze a cell population of a known quantity.
Priority: This is the highest-priority filtering parameter.

Default: None

Example:

# Force the output of 5000 cells for analysis
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --forcecells 5000

`--frags_cutoff` (Optional)

Set the minimum number of unique fragments to retain a cell.

Function: This is a core cell quality control parameter. Cells below this threshold are considered to have poor data quality and will be excluded from subsequent analysis.
Recommendation: Use the default value for the initial analysis, then determine a more appropriate threshold based on the "Fragments Count Distribution" plot in the "TSS Targeting" section of the web report.

Default: 1000

Example:

# Lower the fragment threshold for cell filtering to 500
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --frags_cutoff 500

`--tss_cutoff` (Optional)

Set the minimum proportion of fragments in TSS regions to retain a cell.

Function: TSS enrichment is a key quality metric for ATAC-seq data. Setting this threshold can effectively exclude low-quality cells caused by technical issues like cell damage or nuclear lysis.

Default: 0 (no filtering)

Example:

# Filter out cells with a TSS region fragment proportion below 0.1
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --tss_cutoff 0.1

`--jaccard_cutoff` (Optional)

The Jaccard similarity threshold for merging multiple barcodes (beads) that potentially belong to the same cell.

Function: Corrects for "duplicate" cell barcodes arising from loading or amplification biases, based on the similarity of chromatin accessibility patterns.
Mode: Supports manual threshold setting or using auto to let the software automatically determine the best threshold based on the OTSU algorithm.

Default: auto

Example:

# Manually set the Jaccard similarity threshold to 0.02
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --jaccard_cutoff 0.02

`--merge_cutoff` (Optional)

Set the minimum number of fragments required for a bead to be included in the Jaccard merging process.

Function: Only beads with fragment counts above this threshold will be included in the Jaccard similarity calculation and merging process. The merged valid cell fragments will be used for subsequent peak calling.
Effect: Filters out low-quality beads before merging, improving the accuracy and efficiency of the merge.
Recommendation: For samples with low total fragments, you can lower this value appropriately to include more beads for merging, thereby obtaining more valid fragments for subsequent analysis.

Default: 500

Example:

# For a low-fragment sample, lower the threshold to 200 to include more beads for merging
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --merge_cutoff 200

🚩 Analysis Settings

`--need_bam` (Flag)

Enable the generation of BAM format files.

Function: Generates a BAM file containing all aligned reads with valid barcodes, which can be used for visualization in tools like IGV or for other custom analyses.
Note: Enabling this option will significantly increase computation time and disk space usage, with an expected runtime increase of 30-50%. Additionally, due to differences in how the chromap aligner generates BAM files versus directly outputting BED files, the final results may vary slightly.

Default: If not set, BAM files are not generated.

`--sample_read_pairs` (Optional)

Extract a specified number of read pairs from the input FASTQ files for analysis.

Function: Used for quick testing of large datasets before a full analysis, or for down-sampling analysis when resources are limited.

Default: None (uses all data)

Example:

--sample_read_pairs 100000000

💡 Analysis Recommendation

For the initial analysis, it is recommended to use the default parameters and then adjust them as needed based on the results report.

📊 Reference Database Construction (mkref)

📊 Usage

$ dnbc4tools atac mkref --help
Usage: dnbc4tools atac mkref [OPTIONS]
optional arguments:
  -h, --help           show this help message and exit

Input files:
  Input genome FASTA and gene annotation GTF files. For mixed species analysis, use comma to separate multiple files.

  --fasta <FILE>       Path to reference genome FASTA file. Multiple files separated by comma
  --ingtf <FILE>       Path to gene annotation GTF file. Multiple files separated by comma

Basic settings:
  --genomeDir <DIR>    Output directory for reference files [default: current directory]
  --species <STR>      Species identifier. For mixed species analysis, use comma separated [default: undefined]

Advanced settings:
  --tag <TYPE>         Select type to generate BED file [default: transcript]
  --chrM <STR>         Mitochondrial chromosome identifier in reference genome [default: auto]
  --chloroplast <STR>  Chloroplast chromosome name, particularly recommended for plants, e.g. "Pt"
  --prefix <STR>       Filter chromosomes by prefix or full name. Not supported for mixed species
  --kmer <INT>         k-mer length, this determines the size of the substrings being extracted [default: 17]
  --window <INT>       Window size, this defines the number of consecutive k-mers within a window [default: 7]
  --noindex            Only generate ref.json without building genome index

📝 Parameter Description

🔴 Required Parameters

`--fasta` (Required)

Provide the reference genome sequence file.

Requirement: Standard FASTA format, primary assembly version is recommended.
Dual-species: Supports providing two comma-separated FASTA files for mixed-species analysis.

Default: None

Example:

--fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa

`--ingtf` (Required)

Provide the gene structure annotation file.

Requirement: Standard GTF format, must contain gene and transcript type annotation entries.
Function: Used to define TSS (Transcription Start Sites) and promoter regions.

Default: None

Example:

--ingtf Homo_sapiens.GRCh38.108.gtf

🟢 Output Settings

`--genomeDir` (Optional)

Specify the output directory for the generated reference database.

Function: All generated reference files (index, annotations, etc.) will be stored in this directory.

Example Output Directory Structure

<genomeDir/species>/
  ├── fasta/
  │   ├── genome.fa
  │   └── genome.index
  ├── genes/
  │   └── genes.gtf
  ├── regions/
  │   ├── chrom.sizes
  │   ├── promoter.bed
  │   └── tss.bed
  └── ref.json

Default: ./ (current directory)

Example:

dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --genomeDir /database/scATAC/GRCh38

`--species` (Optional)

Specify a species name for the reference database.

Function: This name is recorded in the configuration file for easy identification later.
Recommendation: Use a standard scientific name format, such as Homo_sapiens.

Default: undefined

Example:

dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --species Homo_sapiens

🟢 Genome Settings

`--tag` (Optional)

Select the source of information for generating the TSS (Transcription Start Site) file.

Options: gene (uses gene start sites) or transcript (uses transcript start sites).
Recommendation: Using transcript mode can yield more accurate TSS enrichment analysis results.

Default: transcript

Example:

# Generate TSS file based on transcript start sites
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --tag transcript

`--chrM` (Optional)

Specify the name of the mitochondrial chromosome.

Function: Used for cell quality control. An excess of mitochondrial fragments usually indicates poor cell quality. Including mitochondrial fragments in the analysis will affect the statistical accuracy of TSS/peak region fragments.
Auto-detection: By default, it will automatically identify from common names (e.g., chrM, MT).

Default: auto

Example:

# If the mitochondrial chromosome name is "mitochondrion"
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --chrM mitochondrion

`--chloroplast` (Plant-specific)

Specify the name of the chloroplast chromosome, recommended for plant samples.

Function: Used for specific quality control of plant samples. Including chloroplast fragments in the analysis will affect the statistical accuracy of TSS/peak region fragments.

Default: None

Example:

# Specify chloroplast chromosome name for Arabidopsis genome
dnbc4tools atac mkref --fasta TAIR10.fa --ingtf Athaliana.gtf --chloroplast Pt

`--kmer` (Optional)

Set the k-mer length used during Chromap index construction.

Function: Affects the accuracy, speed, and memory usage of alignment.
Recommendation: For standard analysis, the default value is usually the best choice. If you encounter out-of-memory errors, you can try lowering this value.

Default: 17

Example:

# Lower k-mer length to reduce memory usage
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --kmer 15

`--window` (Optional)

Set the window size used during Chromap index construction.

Function: Defines the number of consecutive k-mers within a window, affecting the sensitivity and specificity of alignment.
Recommendation: Usually adjusted in conjunction with the --kmer parameter for optimal results.

Default: 7

Example:

# Adjust window size
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --window 5

`--noindex` (Flag)

If this parameter is set, it will only generate the configuration file without building the genome index.

Function: Use this parameter to skip the time-consuming index construction step when the index files already exist.

Default: Not set

Example:

# Generate only the configuration file, do not build the index
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --noindex

Tip

📋 Database Construction Notes:

Databases built with Chromap currently cannot handle extremely large genomes. Some species may not be suitable for scATAC analysis with this software, or you may need to adjust kmer and window parameters to fit the genome index construction.
After database construction is complete, a ref.json file will be generated in the database directory to record key information.

📋 ref.json File Example:

{
    "species": "Homo_sapiens",
    "input_fasta_files": [
        "genome.fa"
    ],
    "input_gtf_files": [
        "genes.gtf"
    ],
    "genome": "/database/scATAC/Homo_sapiens/fasta/genome.fa",
    "index": "/database/scATAC/Homo_sapiens/fasta/genome.index",
    "gtf": "/database/scATAC/Homo_sapiens/genes/genes.gtf",
    "chrmt": "chrM",
    "chloroplast": "None",
    "chromeSize": "/database/scATAC/Homo_sapiens/regions/chrom.sizes",
    "tss": "/database/scATAC/Homo_sapiens/regions/tss.bed",
    "promoter": "/database/scATAC/Homo_sapiens/regions/promoter.bed",
    "version": "3.0beta",
    "blacklist": "None",
    "genomesize": "hs"
}

📋 Important Notes:

Chromosome names listed in the chromeSize file will be included in the fragments.tsv.gz file for analysis; unlisted chromosomes will be excluded.
As of version 2.1.2, the blacklist parameter has been removed, and a blacklist file is no longer needed. It can be added manually if required.
The number of fragments in blacklist regions will be recorded in the blacklist_region_fragments column of the metadata file output/singlecell.csv.
The genomesize value is used for MACS2 peak calling analysis. MACS2 has special identifiers for certain species, such as "hs" for humans.

📋 Multi-sample Operations (multi)

📊 Usage

$ dnbc4tools atac multi 
Usage: dnbc4tools atac multi [OPTIONS]
optional arguments:
  -h, --help         show this help message and exit
  --list <STR>       Path to the sample list file. Each line should contain sample name and FASTQ paths.
  --outdir <DIR>     Output directory. [default: current directory].
  --threads <INT>    Number of threads used for analysis.
  --genomeDir <DIR>  Path to the directory where genome files are stored.

📝 Parameter Description

🔴 Required Parameters

`--list` (Required)

Specify the path to the list file containing information for multiple samples.

File Format: Tab-separated (\t) text file, UTF-8 encoding recommended.
Column Structure: The first column is the sample name, and the second column is the path to the corresponding FASTQ data for that sample.

Path Format Rules

Multiple fastq files: Use commas (,) to separate.
R1 and R2 files: Use semicolons (;) to separate.
Path Type: Both absolute and relative paths are supported.

File Content Example

# Scenario 1: Sample A, with one pair of R1/R2 files
SampleA	/path/to/SampleA_R1.fastq.gz;/path/to/SampleA_R2.fastq.gz

# Scenario 2: Sample B, with two pairs of R1/R2 files (files for the same Read are comma-separated)
SampleB	/path/to/B_L01_R1.fq.gz,/path/to/B_L02_R1.fq.gz;/path/to/B_L01_R2.fq.gz,/path/to/B_L02_R2.fq.gz

Default: None

📝 Parameter Inheritance Note

For other analysis parameter settings, please refer to the corresponding parameters of the dnbc4tools atac run command.

💡 Tip

This document is continuously updated. If you find any errors or have information to add, your feedback is welcome.

📝 Document Version: 3.0 beta | Last Updated: 2025

🔬 DNBelab C Series HT scATAC Analysis Software
High-performance single-cell ATAC sequencing data analysis pipeline

🧬 DNBelab C Series HT scATAC Analysis Parameters

🔬 Main Analysis Pipeline (run)

📊 Usage

📝 Parameter Description

🔴 Required Parameters

-n, --name (Required)

-g, --genomeDir (Required)

🟢 Input File Parameters

--fastqs (Method 1)

-1, --fastq1 (Method 2A)

-2, --fastq2 (Method 2B)

🟢 Basic Settings

-o, --outdir (Optional)

-t, --threads (Optional)

🟢 Library Settings

--darkreaction (Optional)

--customize (Advanced)

🟢 Filtering Settings

--forcecells (Optional)

--frags_cutoff (Optional)

--tss_cutoff (Optional)

--jaccard_cutoff (Optional)

--merge_cutoff (Optional)

🚩 Analysis Settings

--need_bam (Flag)

--sample_read_pairs (Optional)

📊 Reference Database Construction (mkref)

📊 Usage

📝 Parameter Description

🔴 Required Parameters

--fasta (Required)

--ingtf (Required)

🟢 Output Settings

--genomeDir (Optional)

--species (Optional)

🟢 Genome Settings

--tag (Optional)

--chrM (Optional)

--chloroplast (Plant-specific)

--kmer (Optional)

--window (Optional)

--noindex (Flag)

📋 Multi-sample Operations (multi)

📊 Usage

📝 Parameter Description

🔴 Required Parameters

--list (Required)

`-n, --name` (Required)

`-g, --genomeDir` (Required)

`--fastqs` (Method 1)

`-1, --fastq1` (Method 2A)

`-2, --fastq2` (Method 2B)

`-o, --outdir` (Optional)

`-t, --threads` (Optional)

`--darkreaction` (Optional)

`--customize` (Advanced)

`--forcecells` (Optional)

`--frags_cutoff` (Optional)

`--tss_cutoff` (Optional)

`--jaccard_cutoff` (Optional)

`--merge_cutoff` (Optional)

`--need_bam` (Flag)

`--sample_read_pairs` (Optional)

`--fasta` (Required)

`--ingtf` (Required)

`--genomeDir` (Optional)

`--species` (Optional)

`--tag` (Optional)

`--chrM` (Optional)

`--chloroplast` (Plant-specific)

`--kmer` (Optional)

`--window` (Optional)

`--noindex` (Flag)

`--list` (Required)