Downloads And Schema

UMDB exposes its database as browser-friendly pages and machine-readable public artifacts.

This page documents where the current release lives, what the main files contain, and how collaborators can reuse the public records without reverse-engineering the website.

Primary Public Files

All links below are versioned with the repository and intended to support direct download, inspection, and local scripting.

Schema Guide

Accession anchors SRR, BioProject, and BioSample identifiers function as the main join points across records and external repository lookups.

Run metadata The nested runinfo_row object carries SRA-derived values such as run date, library strategy, center, and platform.

Location metadata The nested geo object stores parsed country and city labels and any raw string retained during enrichment.

Assay metadata The nested assay object stores UMDB’s higher-level assay classification for filtering and analytics.

AI curation metadata The optional nested ai_curation object stores AI-reviewed sufficiency, urban-origin decisions, and corrected final annotations alongside the original metadata.

Reuse Guidance

Prefer manifest-driven loading Scripts should read the manifest first so they can accommodate future chunk-count changes without hardcoding file names.

Expect imperfect location data Downstream analyses that depend on geographic precision should explicitly filter or manually verify unresolved or ambiguous entries.

Retain accession provenance When exporting subsets, keep the original accessions so any study can be traced back to its source repository pages.

Downloading Raw FASTQ Files

UMDB now supports search-derived dataset bundling so filtered results can be converted into raw-read download inputs.

From the main explorer

Use the filters in the main database explorer, then export one of the dataset bundle options: Bundle matched runs (JSON), SRR accession list, or FASTQ download script.

FASTQ script output

The FASTQ script is generated from the current matched search results and uses SRA Toolkit commands prefetch and fasterq-dump to retrieve raw reads for the exported SRR accessions.

Bundle reuse

The JSON bundle preserves the search filters, project summaries, and run-level accessions so a dataset cohort can be archived, shared with collaborators, or re-used later without reconstructing the same search by hand.

Typical Workflow

Search for a cohort in the explorer, export the matched SRR bundle, and then run the generated shell script on a machine with SRA Toolkit installed. This gives UMDB a practical bridge from searchable metadata to raw FASTQ retrieval.

Suggested Citation And Use Statement

The exact manuscript citation can be added later, but the site can still provide a reviewer-friendly placeholder now.

Users of UMDB should cite both the UMDB resource and the original repositories or studies associated with any reused BioProject, BioSample, or SRR accessions. UMDB reorganizes public metadata for discovery and comparison; it does not replace the need to cite primary data generators.