GEO/SRA Data Download¶
This document describes how to use SeqNado to download data from GEO/SRA.
Quick Start¶
1. Get metadata from GEO/ENA¶
First, obtain the metadata TSV file from GEO or ENA. For example: - Go to ENA Browser - Search for your project (e.g., PRJNA1234567) - Download the "TSV" report with run information
Required columns:
- run_accession (e.g., SRR123456)
- sample_title (sample name)
- library_name (e.g., GSM identifier)
- library_layout (PAIRED or SINGLE) - Required for download
The library_layout column is essential for determining whether to create one or two FASTQ files per sample.
It is advisable to alter the sample_title column to use seqnado naming conventions, removing any spaces.
See Configuration for more details.
2. Download FASTQs¶
Run the download command:
This will:
- Parse the metadata TSV
- Download all FASTQ files using prefetch/fasterq-dump
- Retry failed downloads with scaled resources on each attempt
- Compress and rename files to: {GSM}-{sample}_R1.fastq.gz
3. Generate design file (optional)¶
To also generate a SeqNado design file:
seqnado download filereport_read_run_PRJNA1234567.tsv \
--outdir fastqs \
--assay rna \
--design-output metadata_rna.csv \
--cores 8
Or run separately after download:
Command Reference¶
Required Arguments¶
METADATA_TSV: Path to TSV file from GEO/ENA with run information
Options¶
-o, --outdir PATH: Output directory for FASTQ files (default: fastqs)-a, --assay TEXT: Assay type for design file generation (rna, atac, chip, etc.)-d, --design-output PATH: Output path for design CSV-c, --cores INT: Number of parallel downloads (default: 4)--preset TEXT: Snakemake profile preset (le/lsf/ss) (default: le)-n, --dry-run: Show what would be downloaded without downloading-v, --verbose: Increase logging verbosity
Examples¶
Download only¶
Download + generate RNA-seq design¶
Download ChIP-seq data + design¶
seqnado download filereport.tsv \
--outdir fastqs \
--assay chip \
--design-output metadata_chip.csv
Dry run to see what would happen¶
File Naming¶
Downloaded files are automatically named based on library layout:
Paired-End Data¶
Single-End Data¶
For example:
- Paired: GSM12345-WT_rep1_R1.fastq.gz, GSM12345-WT_rep1_R2.fastq.gz
- Single: GSM12345-WT_rep1.fastq.gz
Library Layout Detection¶
The download command uses the library_layout column from your TSV to determine how to process each sample:
- PAIRED: Uses
geo_download_pairedrule → creates_R1.fastq.gzand_R2.fastq.gz - SINGLE: Uses
geo_download_singlerule → creates.fastq.gz(no R1/R2 suffix)
This approach ensures proper file structure without creating empty placeholder files.
Troubleshooting¶
Download fails¶
The download rule includes automatic retry logic via Snakemake's resource scaling — memory and time are doubled on each retry attempt. Check logs for details:
- Full logging in logs/geo_download/{sample}.log
Missing columns in TSV¶
Make sure your TSV has these required columns:
- run_accession
- sample_title
- library_name
- library_layout (with values 'PAIRED' or 'SINGLE')
The library_layout column is typically included in TSV downloads from ENA. If it's missing from your source:
1. Check the SRA/GEO database for the layout information
2. Add it manually to your TSV file
3. Or download the full metadata from ENA which includes this column
Memory issues¶
If downloads run out of memory, adjust the resources in the Snakemake profile or reduce the number of parallel downloads with -c.
Integration with SeqNado Pipeline¶
After downloading and generating a design file, you can run the full SeqNado pipeline: