🏠 Home β€’ δΈ­ζ–‡

🧬 DNBelab C Series HT scATAC Analysis Parameters

πŸ”¬ Main Analysis Pipeline (run) β€’ πŸ“Š Reference Database Construction (mkref) β€’ πŸ“‹ Multi-sample Operations (multi)


πŸ”¬ Main Analysis Pipeline (run)

πŸ“Š Usage

$ dnbc4tools atac run --help
Usage: dnbc4tools atac run [OPTIONS]

optional arguments:
  -h, --help            show this help message and exit

Input Files:
  Choose ONE input method: either --fastqs (directory) OR individual FASTQ files (-1 and -2).

  --fastqs <DIR>        Input directory containing paired-end FASTQ files. The pipeline automatically detects Read1/Read2 files. Example: ./fastq_dir
  -1, --fastq1 <FILE> [<FILE> ...]
                        Read1 FASTQ file(s) for the ATAC library (supports wildcards and comma-separated lists). Example: sample1_L01_R1.fastq.gz,sample1_L02_R1.fastq.gz
  -2, --fastq2 <FILE> [<FILE> ...]
                        Read2 FASTQ file(s) for the ATAC library (supports wildcards and comma-separated lists). Must match --fastq1 order. Example: sample1_L01_R2.fastq.gz,sample1_L02_R2.fastq.gz

Basic Settings:
  -n, --name <STR>      Unique identifier for the sample (e.g., sample1). Used for naming output files and reports.
  -g, --genomeDir <DIR>
                        Path to reference genome directory. Must contain the required index and annotation resources.
  -o, --outdir <DIR>    Output directory for results and reports [default: current directory]. Example: ./output
  -t, --threads <INT>   Number of CPU threads for parallel processing [default: 10].

Library Settings:
  Configure sequencing library settings and dark cycles.
  Auto-detection is recommended for dark cycles.
  Use --customize to specify sequence structure patterns when needed.

  --darkreaction <STR>  Dark cycle setting for ATAC library [default: auto]. Options: auto (automatic detection), R1R2 (both reads), R1 (Read1 only), R2 (Read2 only), unset (no dark cycles).
  --customize <STR>     Customize read structure for barcode/sequence extraction, format: <type>,<read>:<start>-<end> separated by ';'. Types: cb (cell barcode), R1 (sequence from Read1), R2 (sequence from Read2). Example:
                        "cb,R1:1-10;cb,R1:11-20;R1,R1:21-70;R2,R2:1-50".

Filtering Settings:
  --forcecells <INT>    Force pipeline to use exactly this number of cells, overriding detection (e.g., 5000).
  --frags_cutoff <INT>  Minimum number of unique fragments to retain a cell [default: 1000].
  --tss_cutoff <FLOAT>  Minimum TSS proportion threshold to retain a cell [default: 0] (e.g., 0.2).
  --jaccard_cutoff <FLOAT>
                        Jaccard similarity threshold for merging beads (e.g., 0.02).
  --merge_cutoff <INT>  Minimum number of fragments when merging beads [default: 1000].

Analysis Settings:
  --need_bam            Enable generation of BAM files containing aligned reads. Note: generating BAM files increases computational time and disk space usage.
  --sample_read_pairs <INT>
                        Subsample the specified number of read pairs from the input FASTQ files (e.g., 1000000).

πŸ“ Parameter Description

πŸ”΄ Required Parameters

⚠️ Essential parameters that must be specified for a successful analysis

-n, --name (Required)

Provide a unique name for this analysis run.

Default: None

Example:

--name sample_001

-g, --genomeDir (Required)

Specify the path to the reference genome directory.

Default: None

Example:

--genomeDir /path/to/genome/database

🟒 Input File Parameters

πŸ“ Choose one input method: Directory-based OR specify individual files

--fastqs (Method 1)

Specify the path to the directory containing all FASTQ files.

Default: None

Example:

--fastqs ./fastq_directory

-1, --fastq1 (Method 2A)

Specify one or more Read1 FASTQ files individually.

Default: None

Example:

--fastq1 sample1_L01_R1.fastq.gz,sample1_L02_R1.fastq.gz

-2, --fastq2 (Method 2B)

Specify one or more Read2 FASTQ files individually.

Default: None

Example:

--fastq2 sample1_L01_R2.fastq.gz,sample1_L02_R2.fastq.gz

⚠️ Input Method Selection:

⚠️ Important Note: All files under a parameter must come from the same library, with consistent sequencing mode and dark reaction settings. Data from different libraries cannot be merged for analysis.


🟒 Basic Settings

-o, --outdir (Optional)

Specify the output directory for all analysis results and reports.

Default: ./ (current directory)

Example:

--outdir ./output_results

-t, --threads (Optional)

Set the number of CPU threads to be used during the analysis.

Default: 10

Example:

--threads 16

🟒 Library Settings

--darkreaction (Optional)

Configure the dark cycle settings for the ATAC library to ensure accurate cell barcode identification.

Examples:

# Scenario 1: Initial analysis, using auto-detection
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref
# Scenario 2: Known library has dark cycles only on R1 and auto-analysis fails or identifies incorrectly
dnbc4tools atac run --name sample2 --fastqs ./fq --genomeDir ./ref --darkreaction R1

⚠️ Important Note: Incorrect settings can lead to cell barcode identification failure or loss of sequence information. Specify manually only if you understand the library structure or if auto-detection fails.

--customize (Advanced)

Precisely define the extraction structure for barcodes and effective sequences (reads) for non-standard libraries.

Examples:

# Example 1: Assume R1 structure is: Barcode 1 (10bp) -> Barcode 2 (10bp) -> Insert (50bp). R2 structure is: Insert (50bp).
--customize "cb,R1:1-10;cb,R1:11-20;R1,R1:21-70;R2,R2:1-50"
# Example 2: Assume R1 structure is: Fixed (6bp) -> Barcode 1 (10bp) -> Fixed (6bp) -> Barcode 2 (10bp) -> Fixed (33bp) -> Insert (50bp). R2 structure is: Fixed (19bp) -> Insert (50bp).
--customize "cb,R1:7-16;cb,R1:23-32;R1,R1:66-115;R2,R2:20-69"

⚠️ Notes:


🟒 Filtering Settings

--forcecells (Optional)

Force the pipeline to use an exact number of cells, overriding the software's automatic cell detection.

Default: None

Example:

# Force the output of 5000 cells for analysis
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --forcecells 5000

--frags_cutoff (Optional)

Set the minimum number of unique fragments to retain a cell.

Default: 1000

Example:

# Lower the fragment threshold for cell filtering to 500
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --frags_cutoff 500

--tss_cutoff (Optional)

Set the minimum proportion of fragments in TSS regions to retain a cell.

Default: 0 (no filtering)

Example:

# Filter out cells with a TSS region fragment proportion below 0.1
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --tss_cutoff 0.1

--jaccard_cutoff (Optional)

The Jaccard similarity threshold for merging multiple barcodes (beads) that potentially belong to the same cell.

Default: auto

Example:

# Manually set the Jaccard similarity threshold to 0.02
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --jaccard_cutoff 0.02

--merge_cutoff (Optional)

Set the minimum number of fragments required for a bead to be included in the Jaccard merging process.

Default: 500

Example:

# For a low-fragment sample, lower the threshold to 200 to include more beads for merging
dnbc4tools atac run --name sample1 --fastqs ./fq --genomeDir ./ref --merge_cutoff 200

🚩 Analysis Settings

--need_bam (Flag)

Enable the generation of BAM format files.

Default: If not set, BAM files are not generated.

--sample_read_pairs (Optional)

Extract a specified number of read pairs from the input FASTQ files for analysis.

Default: None (uses all data)

Example:

--sample_read_pairs 100000000

πŸ’‘ Analysis Recommendation

For the initial analysis, it is recommended to use the default parameters and then adjust them as needed based on the results report.


πŸ“Š Reference Database Construction (mkref)

πŸ“Š Usage

$ dnbc4tools atac mkref --help
Usage: dnbc4tools atac mkref [OPTIONS]
optional arguments:
  -h, --help           show this help message and exit

Input files:
  Input genome FASTA and gene annotation GTF files. For mixed species analysis, use comma to separate multiple files.

  --fasta <FILE>       Path to reference genome FASTA file. Multiple files separated by comma
  --ingtf <FILE>       Path to gene annotation GTF file. Multiple files separated by comma

Basic settings:
  --genomeDir <DIR>    Output directory for reference files [default: current directory]
  --species <STR>      Species identifier. For mixed species analysis, use comma separated [default: undefined]

Advanced settings:
  --tag <TYPE>         Select type to generate BED file [default: transcript]
  --chrM <STR>         Mitochondrial chromosome identifier in reference genome [default: auto]
  --chloroplast <STR>  Chloroplast chromosome name, particularly recommended for plants, e.g. "Pt"
  --prefix <STR>       Filter chromosomes by prefix or full name. Not supported for mixed species
  --kmer <INT>         k-mer length, this determines the size of the substrings being extracted [default: 17]
  --window <INT>       Window size, this defines the number of consecutive k-mers within a window [default: 7]
  --noindex            Only generate ref.json without building genome index

πŸ“ Parameter Description

πŸ”΄ Required Parameters

--fasta (Required)

Provide the reference genome sequence file.

Default: None

Example:

--fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa

--ingtf (Required)

Provide the gene structure annotation file.

Default: None

Example:

--ingtf Homo_sapiens.GRCh38.108.gtf

🟒 Output Settings

--genomeDir (Optional)

Specify the output directory for the generated reference database.

Default: ./ (current directory)

Example:

dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --genomeDir /database/scATAC/GRCh38

--species (Optional)

Specify a species name for the reference database.

Default: undefined

Example:

dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --species Homo_sapiens

🟒 Genome Settings

--tag (Optional)

Select the source of information for generating the TSS (Transcription Start Site) file.

Default: transcript

Example:

# Generate TSS file based on transcript start sites
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --tag transcript

--chrM (Optional)

Specify the name of the mitochondrial chromosome.

Default: auto

Example:

# If the mitochondrial chromosome name is "mitochondrion"
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --chrM mitochondrion

--chloroplast (Plant-specific)

Specify the name of the chloroplast chromosome, recommended for plant samples.

Default: None

Example:

# Specify chloroplast chromosome name for Arabidopsis genome
dnbc4tools atac mkref --fasta TAIR10.fa --ingtf Athaliana.gtf --chloroplast Pt

--kmer (Optional)

Set the k-mer length used during Chromap index construction.

Default: 17

Example:

# Lower k-mer length to reduce memory usage
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --kmer 15

--window (Optional)

Set the window size used during Chromap index construction.

Default: 7

Example:

# Adjust window size
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --window 5

--noindex (Flag)

If this parameter is set, it will only generate the configuration file without building the genome index.

Default: Not set

Example:

# Generate only the configuration file, do not build the index
dnbc4tools atac mkref --fasta genome.fa --ingtf genes.gtf --noindex

Tip

πŸ“‹ Database Construction Notes:

πŸ“‹ ref.json File Example:

{
    "species": "Homo_sapiens",
    "input_fasta_files": [
        "genome.fa"
    ],
    "input_gtf_files": [
        "genes.gtf"
    ],
    "genome": "/database/scATAC/Homo_sapiens/fasta/genome.fa",
    "index": "/database/scATAC/Homo_sapiens/fasta/genome.index",
    "gtf": "/database/scATAC/Homo_sapiens/genes/genes.gtf",
    "chrmt": "chrM",
    "chloroplast": "None",
    "chromeSize": "/database/scATAC/Homo_sapiens/regions/chrom.sizes",
    "tss": "/database/scATAC/Homo_sapiens/regions/tss.bed",
    "promoter": "/database/scATAC/Homo_sapiens/regions/promoter.bed",
    "version": "3.0beta",
    "blacklist": "None",
    "genomesize": "hs"
}

πŸ“‹ Important Notes:


πŸ“‹ Multi-sample Operations (multi)

πŸ“Š Usage

$ dnbc4tools atac multi 
Usage: dnbc4tools atac multi [OPTIONS]
optional arguments:
  -h, --help         show this help message and exit
  --list <STR>       Path to the sample list file. Each line should contain sample name and FASTQ paths.
  --outdir <DIR>     Output directory. [default: current directory].
  --threads <INT>    Number of threads used for analysis.
  --genomeDir <DIR>  Path to the directory where genome files are stored.

πŸ“ Parameter Description

πŸ”΄ Required Parameters

--list (Required)

Specify the path to the list file containing information for multiple samples.

File Content Example
# Scenario 1: Sample A, with one pair of R1/R2 files
SampleA	/path/to/SampleA_R1.fastq.gz;/path/to/SampleA_R2.fastq.gz
# Scenario 2: Sample B, with two pairs of R1/R2 files (files for the same Read are comma-separated)
SampleB	/path/to/B_L01_R1.fq.gz,/path/to/B_L02_R1.fq.gz;/path/to/B_L01_R2.fq.gz,/path/to/B_L02_R2.fq.gz

Default: None


πŸ“ Parameter Inheritance Note

For other analysis parameter settings, please refer to the corresponding parameters of the dnbc4tools atac run command.


πŸ’‘ Tip

This document is continuously updated. If you find any errors or have information to add, your feedback is welcome.

πŸ“ Document Version: 3.0 beta | Last Updated: 2025


πŸ”¬ DNBelab C Series HT scATAC Analysis Software
High-performance single-cell ATAC sequencing data analysis pipeline