Implementing product thinking for satellite imagery datasets
Transitioning from centralized raster warehouses to domain-scoped data products shifts the primary failure vector from storage capacity to metadata contract enforcement. This guide isolates a single, high-impact implementation step: configuring a versioned, domain-bound SpatioTemporal Asset Catalog (STAC) ingestion pipeline that treats multispectral satellite imagery as a consumable product rather than a raw archive. The workflow operationalizes Product Thinking for GIS Datasets by enforcing strict spatial scoping rules, automated schema validation, and deterministic rollback paths.
Architectural Baseline: Domain-Scoped Ingestion
In legacy monolithic systems, raster data is ingested into shared buckets with implicit coordinate reference system (CRS) assumptions and deferred cataloging. The Geospatial Data Mesh Fundamentals paradigm inverts this by treating each spatial domain as an autonomous product boundary. Data Mesh vs Traditional GIS Architecture distinctions manifest in explicit contract enforcement at the ingestion edge rather than post-hoc ETL reconciliation.
Spatial Domain Boundary Design dictates that every tile must be validated against a deterministic polygon envelope before entering the product catalog. This prevents silent CRS drift and ensures downstream consumers receive spatially consistent assets. The pipeline must apply Scoping Rules for Spatial Products at the ingestion layer, rejecting tiles that fall outside the declared boundary_wkt or lack mandatory projection extensions.
Declarative Pipeline Configuration
The tactical objective is to deploy a declarative configuration that binds Sentinel-2 L2A tiles to a specific spatial domain while publishing them as versioned data products. This requires coupling rio-tiler tiling parameters with pystac validation hooks.
# product-config.yaml
spatial_domain:
crs: "EPSG:4326"
boundary_wkt: "POLYGON((-122.5 37.7, -122.3 37.7, -122.3 37.9, -122.5 37.9, -122.5 37.7))"
product_metadata:
product_id: "sentinel2-msi-l2a-v2.1"
version_strategy: "semantic"
lifecycle_stage: "production"
ingestion:
tile_size: 256
resampling: "bilinear"
output_format: "cloud-optimized-geotiff"
checksum_algorithm: "sha256"
Idempotent Execution & Atomic Commit
Idempotency is enforced through pre-flight validation, content-addressable storage, and atomic staging. The pipeline must never mutate existing catalog entries without explicit version increments.
The recommended approach uses pystac for item creation and validation, and cogeo-mosaic or rio-cogeo for COG validation. The following pattern shows how to validate and ingest a tile with explicit dry-run logic:
import hashlib
import json
import sys
from pathlib import Path
import pystac
from pystac.validation import validate_dict
from shapely.geometry import box, mapping
BOUNDARY = box(-122.5, 37.7, -122.3, 37.9)
PRODUCT_ID = "sentinel2-msi-l2a-v2.1"
def build_stac_item(tile_path: Path, dry_run: bool = True) -> pystac.Item:
"""Build and optionally publish a STAC item for a COG tile."""
tile_hash = hashlib.sha256(tile_path.read_bytes()).hexdigest()
item = pystac.Item(
id=f"{PRODUCT_ID}-{tile_hash[:8]}",
geometry=mapping(BOUNDARY),
bbox=list(BOUNDARY.bounds),
datetime=None,
properties={
"start_datetime": "2024-01-01T00:00:00Z",
"end_datetime": "2024-03-31T23:59:59Z",
"platform": "sentinel-2",
"instruments": ["msi"],
"checksum:sha256": tile_hash,
},
)
item.add_asset(
"data",
pystac.Asset(
href=str(tile_path),
media_type=pystac.MediaType.COG,
roles=["data"],
),
)
# Validate against STAC spec before committing
validate_dict(item.to_dict())
if dry_run:
print(f"DRY RUN: Item {item.id} validated successfully")
return item
# Atomic commit: write to catalog only after validation passes
item.normalize_hrefs(f"s3://geo-mesh-prod/sentinel2/{PRODUCT_ID}/")
print(f"Committed: {item.id}")
return item
Run with dry_run=True first to trigger schema validation and spatial envelope intersection checks without writing to the target bucket. Set dry_run=False only after the dry-run passes.
Diagnostic Workflow: Metadata Contract Drift
When pystac throws STACValidationError: Missing required field 'proj:geometry', the root cause typically traces to upstream gdalwarp operations stripping projection metadata during domain clipping. Traditional architectures mask this by relying on implicit CRS assumptions; domain-driven ingestion requires explicit contract validation. Metadata Cataloging for Raster/Vector workflows must intercept this drift before catalog publication.
Execute the following diagnostic sequence to isolate the drift:
# 1. Inspect raw tile metadata for projection extensions
rio info s3://raw-bucket/tile_123.tif | jq '.crs'
# 2. Validate STAC item schema against strict spec
python3 -c "
import pystac, json, sys
item_dict = json.load(open(sys.argv[1]))
pystac.read_dict(item_dict).validate()
print('VALID')
" s3://geo-mesh-prod/sentinel2/items/tile_123.json
# 3. Trace clipping command history for missing creation options
grep "gdalwarp" /var/log/ingest/pipeline.log | tail -n 5
If crs returns null, the clipping step bypassed the necessary creation options. Remediate by injecting explicit options into the warp call. Refer to the official GDAL Warp Documentation for creation option specifications:
gdalwarp \
-t_srs EPSG:4326 \
-te -122.5 37.7 -122.3 37.9 \
-co "TILED=YES" \
-co "COMPRESS=DEFLATE" \
-co "BIGTIFF=IF_SAFER" \
s3://raw-bucket/tile_123.tif /vsimem/tile_123_clipped.tif
The -co "PROFILE=GeoTIFF" option does not control CRS embedding in GDAL; what matters is that -t_srs is set and the output driver is GeoTIFF. Re-run validation after the warp. If the proj:geometry STAC extension field is still missing, populate it explicitly using pyproj and the pystac-ext projection extension before catalog submission.
Spatial Product Versioning & Lifecycle Enforcement
Spatial Product Lifecycle Management dictates that assets transition through staging → validation → production states. Promotion to production requires:
- Successful
pystacitem validation against the STAC specification and any declared extensions - SHA-256 checksum match between source tile and staged COG
- Explicit domain owner sign-off via governance webhook
Semantic versioning (MAJOR.MINOR.PATCH) governs deterministic consumer upgrades. Major increments require schema-breaking changes (e.g., CRS migration). Minor increments denote additive metadata or processing improvements. Patch increments cover reprocessing of corrupted tiles without altering the spatial contract.
The STAC specification mandates strict adherence to extension schemas. Consult the official SpatioTemporal Asset Catalog Specification for required fields and extension compliance matrices.
Cross-Team Governance & Escalation Paths
Automated routing of validation failures to the appropriate ownership tier ensures consistent incident response:
| Severity | Trigger Pattern | Automated Action | Escalation Path |
|---|---|---|---|
| P3 | STACValidationError: Missing optional field |
Log warning, proceed | Domain Data Steward (Slack #geo-catalog) |
| P2 | Null CRS or mismatched projection | Halt pipeline, quarantine tile in s3://geo-mesh-quarantine/ |
Platform Engineer + GIS Steward (Jira GEO-INGEST) |
| P1 | boundary_wkt intersection failure or checksum mismatch |
Rollback staging, invalidate cache, trigger re-clipping | Architecture Review Board + Incident Commander |
Idempotent rollback removes the failed version from the catalog index and restores the previous manifest:
import pystac
# Load the catalog and remove the failed item version
catalog = pystac.Catalog.from_file("s3://geo-mesh-prod/sentinel2/catalog.json")
failed_item_id = "sentinel2-msi-l2a-v2.1-abc12345"
item = catalog.get_item(failed_item_id, recursive=True)
if item:
item.get_parent().remove_item(failed_item_id)
catalog.normalize_and_save(
root_href="s3://geo-mesh-prod/sentinel2/",
catalog_type=pystac.CatalogType.SELF_CONTAINED,
)
print(f"PRODUCT_ROLLBACK_COMPLETE: {failed_item_id} removed from catalog")
All pipeline executions must be logged with structured JSON output for audit trails and drift analysis.