UMDB — Methodology

Published Database Structure

Database overview

The explorer loads a manifest and one or more part files containing SRR-level records. Those records are aggregated into one row per BioProject and can be expanded to reveal underlying SRR and SRA run detail.

Manifest: docs/db/srr_records_manifest.json stores the file list, generation timestamp, and total record count.
Parts: docs/db/srr_records_partXXX.json store arrays of SRR records built from RunInfo fields plus derived annotations.

What is in each record?

Core fields are sourced from SRA RunInfo and joined with optional enrichments:

runinfo_row SRR, BioProject, BioSample, dates, platform, and center metadata from the public run record.

bioproject Project accession and title when those enrichments are available.

geo Country and city fields derived from BioSample and location parsing.

assay Assay class labels such as WGS, RNA-seq, or 16S/ITS for filtering and summary analytics.

Geo-resolved browsing

The explorer includes Require Country/City and Exclude (unknown) options so users can restrict browsing to records with more complete geographic resolution when location-aware discovery is the priority.

Workflow Summary

1. Discovery

Harvest candidate records from NCBI-linked resources

The pipeline queries public NCBI resources and tracks associated SRA identifiers, BioProjects, and BioSamples that match the project’s urban environmental collection strategy.

2. Normalization

Preserve SRR-level metadata with stable keys

RunInfo-derived fields such as accession, dates, sequencing center, library strategy, platform, and download path are stored alongside the original SRR accession so the record remains auditable.

3. Enrichment

Attach BioProject and geographic context

Linked project metadata and parsed location fields are added when available, allowing the interface to summarize records by study, geography, and assay class.

Technical Documentation

Acquisition Layer

How records enter the database

Search strategy UMDB queries public NCBI-linked resources with one or more urban-environment query profiles rather than relying on a single keyword expression. This increases recall for studies described with different vocabulary.

Primary execution commands The harvester is driven through python3 -m scripts.UMDB_harvester.cli with the main commands crawl, daily, backfill-year, and curate-ai.

Run-level accumulation Newly discovered records are appended into year-scoped JSONL catalogs under data/, which function as the internal accumulation layer before the web release is rebuilt.

Deduplication The pipeline tracks seen SRA UIDs and SRR accessions so repeated crawl runs do not re-emit the same accession into the release catalogs.

Enrichment Layer

How metadata are normalized and joined

RunInfo backbone Each record starts from SRA RunInfo-style fields such as SRR accession, BioProject, BioSample, sequencing center, library strategy, release date, and run download path.

BioProject joins When enabled, linked BioProject records are cached and joined to add project accession, title, description, and related project-level context.

BioSample joins BioSample metadata are inspected for structured attributes, free-text descriptions, and location-related fields. If latitude and longitude are present, they are propagated into the record and used by the map views.

Derived annotations Geographic harmonization and assay-class labeling are computed from the original public metadata and added as derived fields rather than replacing the source record.

AI Review Layer

How AI-assisted curation is applied

Purpose AI review is used as a metadata curation layer, not as the source of record. It evaluates whether the available evidence is sufficient, whether a sample appears to come from an urban context, and whether country, city, or assay annotations likely need correction.

Evidence inspected The curation step considers the harvested record, linked BioSample metadata, linked BioProject metadata, titles, descriptions, and other available repository text together before writing an ai_curation object.

Non-destructive output Original metadata are preserved. AI-reviewed values are stored alongside them as final annotations such as final_country, final_city, and final_assay_class, plus review flags such as metadata_sufficient, urban_origin, and ai_fixed.

Caching AI reviews are cached so repeated runs do not need to re-curate unchanged records unless an overwrite pass is explicitly requested.

Publication Layer

How the website is generated

Export rebuild The exporter reconstructs the public release from internal catalogs and caches, then writes browser-consumable JSON into docs/db/.

Chunked delivery Large SRR collections are split into a manifest file plus one or more part files so the browser can stream the release in manageable pieces instead of loading one monolithic document.

Client-side aggregation The website loads SRR-level data in JavaScript and aggregates it into BioProject-level views, analytics panels, charts, maps, and export bundles directly in the browser.

Release traceability Each rebuild records a generation timestamp in the public manifest so readers can tell when the current release snapshot was produced.

Operational Flow

1. Discovery crawl A crawl, daily refresh, or year backfill queries the target search profiles and resolves matching SRA-linked records.

2. Record ingestion Matching UIDs are normalized into SRR-level records, deduplicated, and appended into the internal year-scoped catalogs under data/.

3. Linked enrichment Optional BioProject and BioSample fetches populate caches that support project descriptions, location harmonization, and sample-coordinate recovery.

4. AI curation If enabled, the curation layer reviews existing or newly harvested records and writes AI review objects into the curation cache.

5. Public export rebuild Export generation merges source metadata, derived annotations, and AI review fields into the public JSON release in docs/db/.

6. Browser delivery The explorer, analytics, charts, downloads, and global map all consume those published JSON artifacts directly in JavaScript.

Automation And Reproducibility

Local execution

The same Python CLI used in development can be run locally for deep searches, targeted backfills, and full AI curation passes. This keeps the acquisition and publication logic aligned across local and hosted runs.

GitHub Actions execution

Repository workflows can run scheduled refreshes as well as manual full-dataset AI curation. The full-dataset workflow invokes curate-ai, rebuilds exported artifacts, and commits updated release files back into version control.

Versioned public release

Because the published database lives in tracked repository artifacts, each site update is inspectable as a versioned release snapshot rather than an opaque database mutation.

Included Fields

runinfo_row Core SRA RunInfo metadata including SRR accession, BioProject, BioSample, release date, platform, center, and run download path.

bioproject Project-level metadata such as title, description, accession, data type, and project URLs when available from NCBI.

geo Parsed location fields including country, city, raw source string, and optional latitude or longitude values when present.

assay Heuristic assay classifications derived from run metadata for high-level filtering and summary analytics.

ai_curation Optional AI-reviewed sufficiency, urban-origin, and corrected final annotations stored alongside the original metadata fields.

Important Caveats

Metadata inherit repository ambiguity If the original public submission is incomplete or inconsistent, UMDB can expose that inconsistency but cannot fully repair it automatically.

Geographic resolution is best-effort Country and city values may originate from structured BioSample fields or parsed location text and should be interpreted as practical, not absolute, harmonization.

Assay classes are summary labels The displayed assay grouping is useful for browsing and counts, but exact downstream analytical suitability should still be confirmed against the original records.

Quality Notes

Static publication model

The public site is distributed as static artifacts under docs/, which reduces infrastructure fragility and makes each published snapshot inspectable in version control.

Chunked database delivery

Large SRR collections are published as a manifest plus part files, enabling browser-based access without requiring a dedicated server-side database.

Expandable provenance

The main explorer aggregates by BioProject for readability but preserves underlying SRR-level rows so users can move from summary view to source-derived detail.

Transparent derivation is part of the database, not an afterthought.