Problem
Data extracted into PEcAn workflows often contains errors that are only caught after model runs fail or after incorrect values have already entered BETYdb:
- Site coordinates outside valid bounds (
lat > 90, lon > 180)
- Harvest dates recorded before planting dates
- Trait
mean values outside physically possible ranges
- Missing required fields (
mean, access_level, variable)
access_level set to values outside the valid range of 1–4
- Inconsistent or missing units across records
There is currently no unified, programmatic validation layer in PEcAn that catches these issues before upload. Each pipeline handles or skips validation independently, leading to silent data corruption and hard-to-debug
model failures downstream.
Context
This gap becomes increasingly critical as BETYdb data intake moves beyond manual web UI entry toward automated pipelines for example, LLM-assisted extraction systems that ingest PDFs and produce structured upload-ready data.
In these workflows, a shared, authoritative validation layer is essential for:
- Catching errors at the source before they propagate
- Providing traceable, per-field error messages for human review
- Enforcing BETYdb's own model-level rules outside of Rails
As a reference point: BETYdb's site.rb, trait.rb, and cultivar.rb already encode these rules as ActiveRecord validations, but there is no equivalent for use in external R-based pipelines.
Proposed Solution
A lightweight, standalone R file validate_bety_upload.R providing four core validation functions:
# 1. Validate site coordinates
validate_coordinates(lat, lon)
# lat ∈ [-90, 90], lon ∈ [-180, 180]
# flags (0, 0) as a likely placeholder
# 2. Validate temporal logic
validate_temporal(planting_date, harvest_date)
# harvest must be strictly after planting
# 3. Validate trait fields
validate_trait(mean, access_level, variable_name)
# mean must be numeric and non-null
# access_level must be in 1..4
# variable_name must be non-empty
# 4. Row-level validation report
validate_experiment_row(row)
# runs all three checks above
# returns: list of pass/fail per field + specific error messages
Each function returns a structured result not just TRUE/FALSE so callers get actionable feedback per field, not just a binary rejection.
Deliverable
An initial PR with:
validate_bety_upload.R — the four functions above, documented with roxygen2
tests/testthat/test_validate_bety_upload.R — full test coverage
- A short
README section documenting usage with examples
I am ready to submit this PR immediately and can extend the scope based on feedback.
Problem
Data extracted into PEcAn workflows often contains errors that are only caught after model runs fail or after incorrect values have already entered BETYdb:
lat > 90,lon > 180)meanvalues outside physically possible rangesmean,access_level,variable)access_levelset to values outside the valid range of 1–4There is currently no unified, programmatic validation layer in PEcAn that catches these issues before upload. Each pipeline handles or skips validation independently, leading to silent data corruption and hard-to-debug
model failures downstream.
Context
This gap becomes increasingly critical as BETYdb data intake moves beyond manual web UI entry toward automated pipelines for example, LLM-assisted extraction systems that ingest PDFs and produce structured upload-ready data.
In these workflows, a shared, authoritative validation layer is essential for:
As a reference point: BETYdb's
site.rb,trait.rb, andcultivar.rbalready encode these rules as ActiveRecord validations, but there is no equivalent for use in external R-based pipelines.Proposed Solution
A lightweight, standalone R file
validate_bety_upload.Rproviding four core validation functions:Each function returns a structured result not just
TRUE/FALSEso callers get actionable feedback per field, not just a binary rejection.Deliverable
An initial PR with:
validate_bety_upload.R— the four functions above, documented with roxygen2tests/testthat/test_validate_bety_upload.R— full test coverageREADMEsection documenting usage with examplesI am ready to submit this PR immediately and can extend the scope based on feedback.