π οΈ GTF File Operations (mkgtf) β’ π BAM to FASTQ (bam2fastq) ⒠𧬠Chromosome Splitting (chromsplit) β’ π FASTQ Subsetting (fqsubC4)
𧬠Core Functionality
A comprehensive tool for GTF file operations, supporting gene type statistics, intelligent filtering, and file format validation. It provides high-quality, standardized gene annotation data for single-cell analysis.
$ dnbc4tools tools mkgtf -h
optional arguments:
-h, --help show this help message and exit
Basic Settings:
--action <STR> Select action type: 'mkgtf'(filter), 'stat'(statistics) or 'check'(validation) [default: mkgtf]
--ingtf <FILE> Path to input GTF annotation file
--output <FILE> Path to output file
Filter Settings:
GTF file format requirements:
RNA analysis requires "gene"/"transcript" and "exon" types, plus gene_id/name and transcript_id/name attributes.
--include <STR> Set filter parameters in 'mkgtf' mode, multiple filters separated by commas. Default includes: protein_coding, lncRNA, lincRNA, antisense, IG_*/TR_* genes
--type <STR> Set according to gene type tag in GTF attributes [default: gene_biotype]
--feature <STR> Select information from feature column. If no 'gene' rows, select 'transcript' [default: gene]
--ingtf (Required)Specify the path to the input GTF gene annotation file.
Default: None
Example:
--ingtf Homo_sapiens.GRCh38.108.gtf
--output (Required)Specify the output file for the processing results.
Default: None
Examples:
# When action is 'mkgtf' (filter)
--output ./filtered_genes.gtf
# When action is 'stat' (statistics)
--output ./gene_statistics.txt
# When action is 'check' (validation)
--output ./corrected.gtf
--action (Optional)Select the type of operation to perform.
mkgtf: (Default) Filter the GTF file based on gene types.stat: Count the gene types in the GTF file.check: Validate and fix the GTF file format.Default: mkgtf
Example:
--action stat
--include (Optional)In mkgtf mode, specify the gene types to keep, separated by commas.
Default: protein_coding,lncRNA,lincRNA,antisense,IG_*,TR_*
Example:
--include protein_coding,lncRNA
--type (Optional)Specify the tag in the GTF attributes used to identify the gene type.
Default: gene_biotype
Example:
--type gene_type
--feature (Optional)Specify from which column (feature) of the GTF file to extract information.
Default: gene
Example:
--feature transcript
Note
dnbc4tools tools mkgtf --action stat --ingtf genes.gtf --output gtfstat.txt --type gene_biotype
dnbc4tools tools mkgtf --action mkgtf --ingtf genes.gtf --output genes.filter.gtf --type gene_biotype
dnbc4tools tools mkgtf --action check --ingtf genes.gtf --output corrected.gtf
π Professional Conversion Tool
An efficient BAM file manipulation tool specialized for converting C4 RNA BAM files into FASTQ format. It supports multi-threaded parallel processing and flexible output configuration.
$ bam2fastq --help
BAM to FASTQ Converter for C4 Single Cell RNA seq Data
Usage: bam2fastq [OPTIONS] <BAM> <OUTPUT>
Arguments:
<BAM> Path to the input BAM file
<OUTPUT> Directory where FASTQ files will be written
Options:
-t, --threads <THREADS> Number of CPU threads for parallel processing [default: 4]
-r, --locus <REGION> Process reads from a specific genomic region (format: chr1:1000-2000)
-n, --reads-per-fastq <READS> Maximum number of reads per FASTQ file. All reads go to a single file if not specified.
--max-memory <MEMORY> Maximum memory to use in MB. If not specified, will be automatically determined based on system resources.
--no-compress Disable gzip compression for output FASTQ files
-h, --help Print help
-V, --version Print version
<BAM> (Required)Specify the path to the input BAM file.
samtools sort -n before processing.Default: None
Example:
/path/to/your.bam
<OUTPUT> (Required)Specify the directory for the output FASTQ files.
Default: None
Example:
/path/to/output_dir
-t, --threads (Optional)Set the number of CPU threads for parallel processing.
Default: 4
Example:
-t 8
-r, --locus (Optional)Process only reads from a specific genomic region.
chromosome:start-end).Default: None
Example:
-r chr1:1000-2000
-n, --reads-per-fastq (Optional)Set the maximum number of reads per output FASTQ file.
Default: None
Example:
-n 10000000
--max-memory <MEMORY> (Optional)Set the maximum memory the tool can use (in MB).
Default: Auto-determined
Example:
--max-memory 8192
--no-compress (Flag)Disable gzip compression for the output FASTQ files to significantly increase analysis speed.
Default: Not set
Note
bam2fastq input.bam ./output_dir --no-compress
bam2fastq -t 8 input.bam ./output_dir --no-compress
bam2fastq -r chr1:1000000-2000000 -t 4 input.bam ./output_dir --no-compress
bam2fastq -n 5000000 -t 4 input.bam ./output_dir --no-compress
𧬠Core Functionality
A professional genome sequence splitting tool that intelligently identifies split points to maintain gene annotation integrity. It is primarily used in ATAC library construction to ensure chromosome lengths do not exceed the 2^29-1 limit.
$ chromsplit --help
Usage: chromsplit [OPTIONS] --fasta <FA> --prefix <PREFIX>
Options:
-f, --fasta <FA> Input genome sequence file in FASTA format
-g, --gtf <GTF> Optional GTF/GFF annotation file for the genome
-o, --prefix <PREFIX> Prefix for output files
--min_length <MIN_LENGTH> Minimum length of output scaffold fragments [default: 300000000]
--max_length <MAX_LENGTH> Maximum length of output scaffold fragments [default: 500000000]
--cut_site <CUT_SITE> Optional cut site file containing predefined split positions
-h, --help Print help
-V, --version Print version
-f, --fasta <FA> (Required)Specify the input genome sequence file.
Default: None
Example:
--fasta genome.fasta
-o, --prefix <PREFIX> (Required)Specify the prefix for the output files.
<prefix>.fa, <prefix>.cutsite.tsv, etc.Default: None
Example:
--prefix split_genome
-g, --gtf <GTF> (Optional)Specify the gene annotation file (GTF/GFF format).
Default: None
Example:
--gtf annotation.gtf
--min_length <MIN_LENGTH> (Optional)Set the minimum length of the output fragments (unit: bp).
Default: 300000000
Example:
--min_length 300000000
--max_length <MAX_LENGTH> (Optional)Set the maximum length of the output fragments (unit: bp).
Default: 500000000
Example:
--max_length 500000000
--cut_site <CUT_SITE> (Optional)Provide a text file containing predefined split positions.
Default: None
Example:
--cut_site predefined_cuts.txt
Note
chromsplit --fasta genome.fasta --prefix split_result
chromsplit --fasta genome.fasta --gtf annotation.gtf --prefix split_genome
chromsplit --fasta genome.fasta --prefix custom_split --min_length 300000000 --max_length 500000000
chromsplit --fasta genome.fasta --gtf annotation.gtf --prefix precise_split --cut_site custom_cuts.txt
π Core Functionality
A professional tool for extracting regions from FASTQ sequences, supporting precise sequence position clipping. It is mainly used to resolve data format inconsistencies from multiple sequencing runs, ensuring standardized processing of C4 sequencing data.
$ fqsubC4 --help
Usage: fqsubC4 [OPTIONS] --input <FILE> --output <FILE> --regions <REGIONS>
Options:
-i, --input <FILE> Path to input FASTQ file
-o, --output <FILE> Path to output FASTQ file
-r, --regions <REGIONS> Comma-separated regions in format start:end (e.g., 7:16,23:32,38:47)
-b, --batch-size <BATCH_SIZE> Batch size for processing [default: 100000]
--buffer-size <BUFFER_SIZE> Buffer size for channel between reader and writer [default: 500]
-h, --help Print help
-V, --version Print version
-i, --input <FILE> (Required)Specify the path to the input FASTQ file.
Default: None
Example:
--input sample_R1.fastq.gz
-o, --output <FILE> (Required)Specify the path for the output FASTQ file.
.gz.Default: None
Example:
--output extracted_R1.fastq.gz
-r, --regions <REGIONS> (Required)Specify the regions to be extracted from the sequences.
start:end format, with multiple regions separated by commas.Default: None
Example:
--regions 7:16,23:32,38:47
-b, --batch-size <BATCH_SIZE> (Optional)Set the number of records per batch for processing (i.e., the number of FASTQ records read into memory at one time).
Default: 100000
Example:
--batch-size 200000
--buffer-size <BUFFER_SIZE> (Optional)Set the buffer size for the channel between the reader and writer.
Default: 500
Example:
--buffer-size 1000
Note
fqsubC4 --input sample.fastq.gz --output extracted.fastq --regions "7:16,23:32"
π‘ Tip
This document is continuously updated. If you find any errors or have information to add, your feedback is welcome.
π Document Version: 3.0 beta | Last Updated: 2025
π οΈ DNBelab C Series HT Tool-based Analysis Parameters
A parameter configuration guide for high-performance single-cell data analysis tools