Skip to content

feat(dwc): export backend — 25 API endpoints, cache engine, EML, RSS#7938

Closed
foozleface wants to merge 1 commit intospecify:mainfrom
calacademy-research:cas/dwc-backend
Closed

feat(dwc): export backend — 25 API endpoints, cache engine, EML, RSS#7938
foozleface wants to merge 1 commit intospecify:mainfrom
calacademy-research:cas/dwc-backend

Conversation

@foozleface
Copy link
Copy Markdown
Collaborator

Contributed by @foozleface

Implements the full server-side DwC export pipeline: 21 new API endpoints (25 total with the 4 existing), a cache engine for pre-built query results, two archive generation paths, EML/RSS support, and a 217-term DwC vocabulary. Builds on the schema models from the DwC schema PR.

Implementation

  • 25 API endpoints (specifyweb/backend/export/urls.py) covering mapping CRUD (create_mapping, update_mapping, delete_mapping, save_mapping_fields, clone_mapping, create_mapping_from_query), dataset CRUD (create_dataset, update_dataset, delete_dataset, clone_dataset), archive generation (generate_dwca, build_cache, cache_status), queries (list_queries, list_mappings, list_export_datasets), schema terms (schema_terms), OccurrenceID validation (validate_occurrence_ids), EML preview (preview_eml), RSS feed (rss_feed, download_feed), and feed updates (force_update, force_update_packages).
  • Direct query execution path (dwca_from_mapping.py) — executes the backing SpQuery live, writes CSV into a DwCA ZIP. Supports core + extension mappings with coreid linking.
  • Cache-first path (dwca_from_cache.py) — reads from pre-built cache tables for faster archive generation when data has not changed.
  • Cache engine (cache.py) — creates/drops/rebuilds MySQL cache tables using stream-and-batch-insert. Tracks build status (idle/building/error) in CacheTableMeta. Uses build_query() from the stored queries engine via SQLAlchemy.
  • Shared utilities (dwca_utils.py) — build_meta_xml() and build_eml_xml() generate standards-compliant meta.xml and eml.xml for the archive. Term name sanitization for safe column/file names.
  • Field adapter (field_adapter.py) — bridges Django Spqueryfield (lowercase attrs) to the EphemeralField interface (camelCase) expected by QueryField.from_spqueryfield() in the stored queries engine.
  • Schema terms vocabulary (schema_terms.json) — 217 DwC terms across 14 groups (Occurrence, Event, Location, Taxon, Identification, GeologicalContext, Record-level, Organism, MeasurementOrFact, ResourceRelationship, MaterialEntity, MaterialSample, Media, Record), each with suggested Specify mapping paths.
  • Default mappings (default_mappings.py) — pre-configured field sets for common DwC profiles.
  • OccurrenceID uniqueness validation — endpoint checks for duplicate GUIDs before archive generation.
  • Path traversal protection on download_feed — validates filenames against directory escape.
  • RSS feed rewrite — now driven by ExportDataSet.isrss flag instead of XML config files, with IPT-compatible <ipt:eml> and <ipt:dwca> elements.
  • Test suite (tests/) — unit tests for models, cache operations, archive generation, attachment URLs, and feed output.

Note: Depends on the DwC schema PR being merged first.

This is part of the DwC export pipeline addressing issues #7709-#7748 (40 GitHub issues for Darwin Core Archive support).

Testing instructions

  • Apply the DwC schema migrations first (from the schema PR)
  • Run the test suite: python manage.py test specifyweb.backend.export
  • Against a real Specify database: create a mapping via POST /export/create_mapping/, assign DwC terms via POST /export/save_mapping_fields/<id>/, then generate an archive via POST /export/generate_dwca/<id>/
  • Verify the generated ZIP contains valid meta.xml, eml.xml, and occurrence.csv
  • Test GET /export/schema_terms/ returns the full vocabulary
  • Test GET /export/validate_occurrence_ids/<id>/ catches duplicate GUIDs
  • Test GET /export/rss_feed/ returns valid RSS XML for datasets with isRss=true
  • Verify GET /export/download_feed/../etc/passwd returns 404 (path traversal protection)

@foozleface
Copy link
Copy Markdown
Collaborator Author

Superseded by the 7-PR atomic stack on issue-cas-* branches:

  1. issue-cas-7746: extensions join table + vocabulary on Schemamapping #8032 — issue-cas-7746: extensions table + vocabulary
  2. issue-cas-7737: CacheTableMeta model + cache table infrastructure #8033 — issue-cas-7737: CacheTableMeta + cache infrastructure
  3. issue-cas-7714: DwC schema terms vocabulary + permissions #8034 — issue-cas-7714: DwC schema terms vocabulary + permissions
  4. issue-cas-7709: Schema Mapper UI shell + Schema Config DwC section #8035 — issue-cas-7709: Schema Mapper UI shell + Schema Config DwC section
  5. issue-cas-7712: Clone endpoint + list APIs + Export Packages shell #8036 — issue-cas-7712: Clone endpoint + list APIs + Export Packages shell
  6. issue-cas-7710: Mapping UI features (NewMappingDialog, autoMap, toolbar) #8037 — issue-cas-7710: Mapping UI features (NewMappingDialog, autoMap, toolbar)
  7. issue-cas-7733: DwC export pipeline Phases 4-6 (DwCA, EML, GBIF, RSS) #8038 — issue-cas-7733: DwC export pipeline Phases 4-6 (DwCA, EML, GBIF, RSS)

The new stack:

@foozleface foozleface closed this Apr 27, 2026
@github-project-automation github-project-automation Bot moved this from 📋Back Log to ✅Done in General Tester Board Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅Done

Development

Successfully merging this pull request may close these issues.

1 participant