🏠 Home • 中文

🧬 DNBelab C Series HT Tool-based Analysis Parameters

🛠️ GTF File Operations (mkgtf) • 📄 BAM to FASTQ (bam2fastq) • 🧬 Chromosome Splitting (chromsplit) • 📝 FASTQ Subsetting (fqsubC4)

🛠️ GTF File Operations (mkgtf)

🧬 Core Functionality

A comprehensive tool for GTF file operations, supporting gene type statistics, intelligent filtering, and file format validation. It provides high-quality, standardized gene annotation data for single-cell analysis.

📊 Usage

$ dnbc4tools tools mkgtf -h

optional arguments:
  -h, --help            show this help message and exit

Basic Settings:
  --action <STR>        Select action type: 'mkgtf'(filter), 'stat'(statistics) or 'check'(validation) [default: mkgtf]
  --ingtf <FILE>        Path to input GTF annotation file
  --output <FILE>       Path to output file

Filter Settings:
  GTF file format requirements:
                  RNA analysis requires "gene"/"transcript" and "exon" types, plus gene_id/name and transcript_id/name attributes.

  --include <STR>       Set filter parameters in 'mkgtf' mode, multiple filters separated by commas. Default includes: protein_coding, lncRNA, lincRNA, antisense, IG_*/TR_* genes
  --type <STR>          Set according to gene type tag in GTF attributes [default: gene_biotype]
  --feature <STR>       Select information from feature column. If no 'gene' rows, select 'transcript' [default: gene]

📝 Parameter Description

🔴 Required Parameters

`--ingtf` (Required)

Specify the path to the input GTF gene annotation file.

Format Requirement: Standard GTF format. GFF or GFF3 formats are not supported.

Default: None

Example:

--ingtf Homo_sapiens.GRCh38.108.gtf

`--output` (Required)

Specify the output file for the processing results.

Function: Generates different types of output files depending on the operation mode.
Auto-creation: The specified output directory will be created automatically if it does not exist.

Default: None

Examples:

# When action is 'mkgtf' (filter)
--output ./filtered_genes.gtf

# When action is 'stat' (statistics)
--output ./gene_statistics.txt

# When action is 'check' (validation)
--output ./corrected.gtf

🟢 Optional Parameters

`--action` (Optional)

Select the type of operation to perform.

mkgtf: (Default) Filter the GTF file based on gene types.
stat: Count the gene types in the GTF file.
check: Validate and fix the GTF file format.

Default: mkgtf

Example:

--action stat

`--include` (Optional)

In mkgtf mode, specify the gene types to keep, separated by commas.

Function: Used to precisely filter for the gene sets you are interested in.

Default: protein_coding,lncRNA,lincRNA,antisense,IG_*,TR_*

Example:

--include protein_coding,lncRNA

`--type` (Optional)

Specify the tag in the GTF attributes used to identify the gene type.

Function: Adapts to the annotation style of GTF files from different sources.

Default: gene_biotype

Example:

--type gene_type

`--feature` (Optional)

Specify from which column (feature) of the GTF file to extract information.

Function: Typically used to specify whether the operation is at the gene or transcript level.
Alternative: If there are no `gene` rows in the GTF file, it is recommended to select `transcript`.

Default: gene

Example:

--feature transcript

Note

💡 Usage Examples

Count gene types:

dnbc4tools tools mkgtf --action stat --ingtf genes.gtf --output gtfstat.txt --type gene_biotype

Filter gene types:

dnbc4tools tools mkgtf --action mkgtf --ingtf genes.gtf --output genes.filter.gtf --type gene_biotype

Validate and fix GTF file:

dnbc4tools tools mkgtf --action check --ingtf genes.gtf --output corrected.gtf

📄 BAM to FASTQ (bam2fastq)

📄 Professional Conversion Tool

An efficient BAM file manipulation tool specialized for converting C4 RNA BAM files into FASTQ format. It supports multi-threaded parallel processing and flexible output configuration.

📊 Usage

$ bam2fastq --help
BAM to FASTQ Converter for C4 Single Cell RNA seq Data

Usage: bam2fastq [OPTIONS] <BAM> <OUTPUT>

Arguments:
  <BAM>     Path to the input BAM file
  <OUTPUT>  Directory where FASTQ files will be written

Options:
  -t, --threads <THREADS>        Number of CPU threads for parallel processing [default: 4]
  -r, --locus <REGION>           Process reads from a specific genomic region (format: chr1:1000-2000)
  -n, --reads-per-fastq <READS>  Maximum number of reads per FASTQ file. All reads go to a single file if not specified.
      --max-memory <MEMORY>      Maximum memory to use in MB. If not specified, will be automatically determined based on system resources.
      --no-compress              Disable gzip compression for output FASTQ files
  -h, --help                     Print help
  -V, --version                  Print version

📝 Parameter Description

🔴 Required Parameters

`<BAM>` (Required)

Specify the path to the input BAM file.

Format Requirement: Must be a valid C4 RNA BAM file, supporting both single-end and paired-end data.
Index Requirement: The BAM file must be indexed (i.e., a corresponding .bai file must exist alongside it).
Paired-end Note: If it is paired-end data, you need to sort it by read name using samtools sort -n before processing.

Default: None

Example:

/path/to/your.bam

`<OUTPUT>` (Required)

Specify the directory for the output FASTQ files.

Function: All converted FASTQ files will be saved in this directory.
Auto-creation: The directory will be created automatically if it does not exist.

Default: None

Example:

/path/to/output_dir

🟢 Optional Parameters

`-t, --threads` (Optional)

Set the number of CPU threads for parallel processing.

Performance Note: Since the tool needs to ensure the output order is consistent with the input, increasing the number of threads does not significantly improve the overall analysis speed.
Recommendation: It is recommended to use the default 4 threads for analysis.

Default: 4

Example:

-t 8

`-r, --locus` (Optional)

Process only reads from a specific genomic region.

Format: Standard genomic coordinate format (chromosome:start-end).
Application: For targeted analysis of specific genes or chromosomal regions.

Default: None

Example:

-r chr1:1000-2000

`-n, --reads-per-fastq` (Optional)

Set the maximum number of reads per output FASTQ file.

Splitting Strategy: Automatically splits large files into smaller ones for easier downstream processing.
Default Behavior: If not specified, all reads will be written to a single file.

Default: None

Example:

-n 10000000

`--max-memory <MEMORY>` (Optional)

Set the maximum memory the tool can use (in MB).

Function: Controls the tool's memory consumption to prevent failures due to insufficient memory.
Auto-determination: If not specified, the tool will automatically allocate memory based on available system resources.

Default: Auto-determined

Example:

--max-memory 8192

`--no-compress` (Flag)

Disable gzip compression for the output FASTQ files to significantly increase analysis speed.

Performance Bottleneck: The main speed bottleneck of the program is writing compressed files.
Strongly Recommended: Using this parameter disables compression, thereby significantly improving the overall analysis speed.
Trade-off: The resulting uncompressed files will occupy more disk space, so ensure you have sufficient storage.

Default: Not set

Note

💡 Usage Examples

Basic conversion:

bam2fastq input.bam ./output_dir --no-compress

High-speed multi-threaded conversion:

bam2fastq -t 8 input.bam ./output_dir --no-compress

Region-specific conversion:

bam2fastq -r chr1:1000000-2000000 -t 4 input.bam ./output_dir --no-compress

Large file splitting conversion:

bam2fastq -n 5000000 -t 4 input.bam ./output_dir --no-compress

🧬 Chromosome Splitting (chromsplit)

🧬 Core Functionality

A professional genome sequence splitting tool that intelligently identifies split points to maintain gene annotation integrity. It is primarily used in ATAC library construction to ensure chromosome lengths do not exceed the 2^29-1 limit.

📊 Usage

$ chromsplit --help

Usage: chromsplit [OPTIONS] --fasta <FA> --prefix <PREFIX>

Options:
  -f, --fasta <FA>           Input genome sequence file in FASTA format
  -g, --gtf <GTF>            Optional GTF/GFF annotation file for the genome
  -o, --prefix <PREFIX>      Prefix for output files
  --min_length <MIN_LENGTH>  Minimum length of output scaffold fragments [default: 300000000]
  --max_length <MAX_LENGTH>  Maximum length of output scaffold fragments [default: 500000000]
  --cut_site <CUT_SITE>      Optional cut site file containing predefined split positions
  -h, --help                 Print help
  -V, --version              Print version

📝 Parameter Description

🔴 Required Parameters

`-f, --fasta <FA>` (Required)

Specify the input genome sequence file.

Format Requirement: Standard FASTA format (.fa, .fasta, .fna).
Content: Must contain complete chromosome or scaffold sequences.

Default: None

Example:

--fasta genome.fasta

`-o, --prefix <PREFIX>` (Required)

Specify the prefix for the output files.

Output Files: The tool will automatically generate files like <prefix>.fa, <prefix>.cutsite.tsv, etc.
File Management: Facilitates batch processing and result tracking.

Default: None

Example:

--prefix split_genome

🟢 Optional Parameters

`-g, --gtf <GTF>` (Optional)

Specify the gene annotation file (GTF/GFF format).

Intelligent Splitting: Providing an annotation file ensures that split points are located in intergenic regions, protecting gene integrity.
Annotation Sync: The tool automatically adjusts and outputs a new annotation file with synchronized coordinates.

Default: None

Example:

--gtf annotation.gtf

`--min_length <MIN_LENGTH>` (Optional)

Set the minimum length of the output fragments (unit: bp).

Function: Ensures that the split fragments are not too small, which could affect subsequent analysis.

Default: 300000000

Example:

--min_length 300000000

`--max_length <MAX_LENGTH>` (Optional)

Set the maximum length of the output fragments (unit: bp).

Technical Limitation: Primarily used to ensure fragment length meets requirements for downstream analyses like ATAC library construction (usually < 2^29-1 bp).

Default: 500000000

Example:

--max_length 500000000

`--cut_site <CUT_SITE>` (Optional)

Provide a text file containing predefined split positions.

Precise Control: Prioritizes splitting at the specified positions in the file, allowing for precise control over split locations.

Default: None

Example:

--cut_site predefined_cuts.txt

Note

💡 Usage Examples

Basic splitting:

chromsplit --fasta genome.fasta --prefix split_result

Intelligent splitting with annotation file:

chromsplit --fasta genome.fasta --gtf annotation.gtf --prefix split_genome

Custom length splitting:

chromsplit --fasta genome.fasta --prefix custom_split --min_length 300000000 --max_length 500000000

Using predefined split positions:

chromsplit --fasta genome.fasta --gtf annotation.gtf --prefix precise_split --cut_site custom_cuts.txt

📝 FASTQ Subsetting (fqsubC4)

📝 Core Functionality

A professional tool for extracting regions from FASTQ sequences, supporting precise sequence position clipping. It is mainly used to resolve data format inconsistencies from multiple sequencing runs, ensuring standardized processing of C4 sequencing data.

📊 Usage

$ fqsubC4 --help

Usage: fqsubC4 [OPTIONS] --input <FILE> --output <FILE> --regions <REGIONS>

Options:
  -i, --input <FILE>           Path to input FASTQ file
  -o, --output <FILE>          Path to output FASTQ file
  -r, --regions <REGIONS>      Comma-separated regions in format start:end (e.g., 7:16,23:32,38:47)
  -b, --batch-size <BATCH_SIZE>  Batch size for processing [default: 100000]
  --buffer-size <BUFFER_SIZE>  Buffer size for channel between reader and writer [default: 500]
  -h, --help                   Print help
  -V, --version                Print version

📝 Parameter Description

🔴 Required Parameters

`-i, --input <FILE>` (Required)

Specify the path to the input FASTQ file.

Format Support: Supports both uncompressed (.fq, .fastq) and gzipped (.fq.gz, .fastq.gz) formats.
Auto-detection: The tool automatically determines the compression format based on the file extension.

Default: None

Example:

--input sample_R1.fastq.gz

`-o, --output <FILE>` (Required)

Specify the path for the output FASTQ file.

Auto-compression: The output file will be automatically compressed if the filename ends with .gz.
Performance Note: GZIP compression will significantly slow down the processing speed.

Default: None

Example:

--output extracted_R1.fastq.gz

`-r, --regions <REGIONS>` (Required)

Specify the regions to be extracted from the sequences.

Format Specification: Use start:end format, with multiple regions separated by commas.
Coordinate System: Coordinates are 1-based (the first base of the sequence is position 1).
Application: Used for extracting Barcodes, UMIs, or for trimming sequences.

Default: None

Example:

--regions 7:16,23:32,38:47

🟢 Optional Parameters

`-b, --batch-size <BATCH_SIZE>` (Optional)

Set the number of records per batch for processing (i.e., the number of FASTQ records read into memory at one time).

Performance Impact: Higher values use more memory but may improve processing performance.
Balancing Act: A balance needs to be found between memory usage and processing efficiency.

Default: 100000

Example:

--batch-size 200000

`--buffer-size <BUFFER_SIZE>` (Optional)

Set the buffer size for the channel between the reader and writer.

Throughput Optimization: Adjust this parameter for better throughput when processing large files.

Default: 500

Example:

--buffer-size 1000

Note

💡 Usage Example

Basic region extraction:

fqsubC4 --input sample.fastq.gz --output extracted.fastq --regions "7:16,23:32"

💡 Tip

This document is continuously updated. If you find any errors or have information to add, your feedback is welcome.

📝 Document Version: 3.0 beta | Last Updated: 2025

🛠️ DNBelab C Series HT Tool-based Analysis Parameters
A parameter configuration guide for high-performance single-cell data analysis tools

🧬 DNBelab C Series HT Tool-based Analysis Parameters

🛠️ GTF File Operations (mkgtf)

📊 Usage

📝 Parameter Description

🔴 Required Parameters

--ingtf (Required)

--output (Required)

🟢 Optional Parameters

--action (Optional)

--include (Optional)

--type (Optional)

--feature (Optional)

💡 Usage Examples

📄 BAM to FASTQ (bam2fastq)

📊 Usage

📝 Parameter Description

🔴 Required Parameters

<BAM> (Required)

<OUTPUT> (Required)

🟢 Optional Parameters

-t, --threads (Optional)

-r, --locus (Optional)

-n, --reads-per-fastq (Optional)

--max-memory <MEMORY> (Optional)

--no-compress (Flag)

💡 Usage Examples

🧬 Chromosome Splitting (chromsplit)

📊 Usage

📝 Parameter Description

🔴 Required Parameters

-f, --fasta <FA> (Required)

-o, --prefix <PREFIX> (Required)

🟢 Optional Parameters

-g, --gtf <GTF> (Optional)

--min_length <MIN_LENGTH> (Optional)

--max_length <MAX_LENGTH> (Optional)

--cut_site <CUT_SITE> (Optional)

💡 Usage Examples

📝 FASTQ Subsetting (fqsubC4)

📊 Usage

📝 Parameter Description

🔴 Required Parameters

-i, --input <FILE> (Required)

-o, --output <FILE> (Required)

-r, --regions <REGIONS> (Required)

🟢 Optional Parameters

-b, --batch-size <BATCH_SIZE> (Optional)

--buffer-size <BUFFER_SIZE> (Optional)

💡 Usage Example

`--ingtf` (Required)

`--output` (Required)

`--action` (Optional)

`--include` (Optional)

`--type` (Optional)

`--feature` (Optional)

`<BAM>` (Required)

`<OUTPUT>` (Required)

`-t, --threads` (Optional)

`-r, --locus` (Optional)

`-n, --reads-per-fastq` (Optional)

`--max-memory <MEMORY>` (Optional)

`--no-compress` (Flag)

`-f, --fasta <FA>` (Required)

`-o, --prefix <PREFIX>` (Required)

`-g, --gtf <GTF>` (Optional)

`--min_length <MIN_LENGTH>` (Optional)

`--max_length <MAX_LENGTH>` (Optional)

`--cut_site <CUT_SITE>` (Optional)

`-i, --input <FILE>` (Required)

`-o, --output <FILE>` (Required)

`-r, --regions <REGIONS>` (Required)

`-b, --batch-size <BATCH_SIZE>` (Optional)

`--buffer-size <BUFFER_SIZE>` (Optional)