Skip to content

Create Dataset Command

Convert BAM files into a Zarr dataset for efficient analysis.

Usage

quantnado create-dataset [OPTIONS] BAM_FILES...

Description

The create-dataset command processes one or more BAM files and creates a Zarr-backed dataset for efficient genomic signal analysis. This is the first step in most QuantNado workflows.

Arguments

BAM Files (Required)

Paths to BAM files to process (one or more):

quantnado create-dataset sample1.bam sample2.bam sample3.bam --output dataset.zarr

BAM files must be: - Coordinate-sorted - Indexed (.bai file in same directory) - Valid BAM format

Options

Output (--output, -o)

Required. Path where the Zarr dataset will be saved:

quantnado create-dataset *.bam --output dataset.zarr

Creates a directory dataset.zarr/ containing the processed data.

Chromsizes (--chromsizes)

Path to chromsizes file or auto-detect from first BAM:

# Specify explicit chromsizes
quantnado create-dataset *.bam --output dataset.zarr --chromsizes hg38.chrom.sizes

# Auto-detect from BAM (default)
quantnado create-dataset *.bam --output dataset.zarr

Chromsizes file format (tab-separated):

chr1    248956422
chr2    242193529
...

Metadata (--metadata)

Path to metadata CSV file:

quantnado create-dataset *.bam --output dataset.zarr --metadata samples.csv

CSV format with sample_id column:

sample_id,condition,replicate,quality
sample1,control,1,high
sample2,control,2,high
sample3,treatment,1,high

Max Workers (--max-workers)

Number of parallel worker threads (default: 1):

# Use more workers for faster processing (uses more memory)
quantnado create-dataset *.bam --output dataset.zarr --max-workers 8

# Use single worker for memory efficiency
quantnado create-dataset *.bam --output dataset.zarr --max-workers 1

Chunk Length (--chunk-len)

Override the position-axis Zarr chunk length. If omitted, QuantNado derives a filesystem-aware default from the output path so network filesystems like CephFS use much larger write units than local disks.

# Let QuantNado auto-select a chunk length from the target filesystem
quantnado create-dataset *.bam --output dataset.zarr

# Pin an explicit chunk length for benchmarking or reproducibility
quantnado create-dataset *.bam --output dataset.zarr --chunk-len 131072

Construction Compression (--construction-compression)

Control build-time compression separately from the on-disk dataset layout. This is useful when benchmarking CephFS write throughput, where lower compression overhead or fully uncompressed construction may outperform the default profile.

# Current default profile
quantnado create-dataset *.bam --output dataset.zarr --construction-compression default

# Lower zstd compression overhead
quantnado create-dataset *.bam --output dataset.zarr --construction-compression fast

# Uncompressed construction writes
quantnado create-dataset *.bam --output dataset.zarr --construction-compression none

Local Staging (--local-staging, --staging-dir)

Build the dataset under local scratch storage and only publish to the final output path after construction succeeds. This is useful on CephFS-backed clusters because it converts many incremental writes into one finalize step.

# Use TMPDIR-backed scratch staging
quantnado create-dataset *.bam \
  --output /ceph/project/dataset.zarr \
  --local-staging \
  --staging-dir "$TMPDIR"

# Let QuantNado choose the system temporary directory when staging is enabled
quantnado create-dataset *.bam \
  --output /ceph/project/dataset.zarr \
  --local-staging

--local-staging is opt-in. --staging-dir can also be supplied directly to pick a specific scratch filesystem.

Overwrite (--overwrite)

Overwrite existing dataset at same path (default: no overwrite):

quantnado create-dataset *.bam --output dataset.zarr --overwrite

Log File (--log-file)

Path to save processing logs (default: quantnado_processing.log):

quantnado create-dataset *.bam --output dataset.zarr --log-file processing.log

Verbose (--verbose, -v)

Enable debug logging:

quantnado create-dataset *.bam --output dataset.zarr --verbose

Examples

Basic Usage

Create dataset from BAM files:

quantnado create-dataset sample1.bam sample2.bam sample3.bam \
  --output my_dataset.zarr

With Chromsizes

Specify explicit chromsizes file:

quantnado create-dataset *.bam \
  --output dataset.zarr \
  --chromsizes /reference/hg38.chrom.sizes

With Metadata

Include sample metadata:

quantnado create-dataset *.bam \
  --output dataset.zarr \
  --metadata samples.csv

Parallel Processing

Use multiple workers for faster processing:

quantnado create-dataset *.bam \
  --output dataset.zarr \
  --max-workers 8 \
  --verbose

Resume Processing

Resume interrupted dataset creation by passing --resume. Without this flag, re-running the command will fail if the store already exists (use --overwrite to replace it instead):

# Resume from where processing left off (skips completed samples)
quantnado create-dataset *.bam --output dataset.zarr --resume

# Or start fresh, overwriting the existing store
quantnado create-dataset *.bam --output dataset.zarr --overwrite

Performance

Typical Run Times

Creation time depends on sequencing depth:

Read Count Time (single sample)
5M reads 1-2 minutes
50M reads 5-10 minutes
100M+ reads 15-30 minutes

Output Size

Zarr dataset size approximates BAM file size:

Read Depth Size
5M reads 500 MB
50M reads 5 GB
100M+ reads 10+ GB

Troubleshooting

BAM file errors

Problem: FileNotFoundError: BAM file not found

Solution: Verify path and ensure .bai index exists:

ls -l sample1.bam sample1.bam.bai
samtools index sample1.bam

Chromsizes errors

Problem: ValueError: chromsizes_dict appears empty

Solution: Provide explicit chromsizes file:

quantnado create-dataset *.bam --output dataset.zarr --chromsizes hg38.chrom.sizes

Memory issues

Problem: Out of memory during processing

Solution: Use fewer workers:

quantnado create-dataset *.bam --output dataset.zarr --max-workers 1

Slow processing

Problem: Dataset creation is very slow

Solution: Use more workers (if memory available):

quantnado create-dataset *.bam --output dataset.zarr --max-workers 8

Python Equivalent

The CLI create-dataset command is equivalent to:

from quantnado import QuantNado
import pandas as pd

qn = QuantNado.from_bam_files(
    bam_files=["sample1.bam", "sample2.bam", "sample3.bam"],
    store_path="dataset.zarr",
    chromsizes="hg38.chrom.sizes",
    metadata="samples.csv",
    overwrite=True
)

See Dataset Creation Guide for Python API details.

See Also