Implementing product thinking for satellite imagery datasets

Transitioning from centralized raster warehouses to domain-scoped data products shifts the primary failure vector from storage capacity to metadata contract enforcement. This guide isolates a single, high-impact implementation step: configuring a versioned, domain-bound SpatioTemporal Asset Catalog (STAC) ingestion pipeline that treats multispectral satellite imagery as a consumable product rather than a raw archive. The workflow operationalizes Product Thinking for GIS Datasets by enforcing strict spatial scoping rules, automated schema validation, and deterministic rollback paths.

Architectural Baseline: Domain-Scoped Ingestion

In legacy monolithic systems, raster data is ingested into shared buckets with implicit coordinate reference system (CRS) assumptions and deferred cataloging. The Geospatial Data Mesh Fundamentals paradigm inverts this by treating each spatial domain as an autonomous product boundary. Data Mesh vs Traditional GIS Architecture distinctions manifest in explicit contract enforcement at the ingestion edge rather than post-hoc ETL reconciliation.

Spatial Domain Boundary Design dictates that every tile must be validated against a deterministic polygon envelope before entering the product catalog. This prevents silent CRS drift and ensures downstream consumers receive spatially consistent assets. The pipeline must apply Scoping Rules for Spatial Products at the ingestion layer, rejecting tiles that fall outside the declared boundary_wkt or lack mandatory projection extensions.

Declarative Pipeline Configuration

The tactical objective is to deploy a declarative configuration that binds Sentinel-2 L2A tiles to a specific spatial domain while publishing them as versioned data products. This requires coupling rio-tiler tiling parameters with pystac validation hooks.

yaml
# product-config.yaml
spatial_domain:
  crs: "EPSG:4326"
  boundary_wkt: "POLYGON((-122.5 37.7, -122.3 37.7, -122.3 37.9, -122.5 37.9, -122.5 37.7))"
product_metadata:
  product_id: "sentinel2-msi-l2a-v2.1"
  version_strategy: "semantic"
  lifecycle_stage: "production"
ingestion:
  tile_size: 256
  resampling: "bilinear"
  output_format: "cloud-optimized-geotiff"
  checksum_algorithm: "sha256"

Idempotent Execution & Atomic Commit

Idempotency is enforced through pre-flight validation, content-addressable storage, and atomic staging. The pipeline must never mutate existing catalog entries without explicit version increments.

The recommended approach uses pystac for item creation and validation, and cogeo-mosaic or rio-cogeo for COG validation. The following pattern shows how to validate and ingest a tile with explicit dry-run logic:

python
import hashlib
import json
import sys
from pathlib import Path

import pystac
from pystac.validation import validate_dict
from shapely.geometry import box, mapping

BOUNDARY = box(-122.5, 37.7, -122.3, 37.9)
PRODUCT_ID = "sentinel2-msi-l2a-v2.1"


def build_stac_item(tile_path: Path, dry_run: bool = True) -> pystac.Item:
    """Build and optionally publish a STAC item for a COG tile."""
    tile_hash = hashlib.sha256(tile_path.read_bytes()).hexdigest()

    item = pystac.Item(
        id=f"{PRODUCT_ID}-{tile_hash[:8]}",
        geometry=mapping(BOUNDARY),
        bbox=list(BOUNDARY.bounds),
        datetime=None,
        properties={
            "start_datetime": "2024-01-01T00:00:00Z",
            "end_datetime": "2024-03-31T23:59:59Z",
            "platform": "sentinel-2",
            "instruments": ["msi"],
            "checksum:sha256": tile_hash,
        },
    )
    item.add_asset(
        "data",
        pystac.Asset(
            href=str(tile_path),
            media_type=pystac.MediaType.COG,
            roles=["data"],
        ),
    )

    # Validate against STAC spec before committing
    validate_dict(item.to_dict())

    if dry_run:
        print(f"DRY RUN: Item {item.id} validated successfully")
        return item

    # Atomic commit: write to catalog only after validation passes
    item.normalize_hrefs(f"s3://geo-mesh-prod/sentinel2/{PRODUCT_ID}/")
    print(f"Committed: {item.id}")
    return item

Run with dry_run=True first to trigger schema validation and spatial envelope intersection checks without writing to the target bucket. Set dry_run=False only after the dry-run passes.

Diagnostic Workflow: Metadata Contract Drift

When pystac throws STACValidationError: Missing required field 'proj:geometry', the root cause typically traces to upstream gdalwarp operations stripping projection metadata during domain clipping. Traditional architectures mask this by relying on implicit CRS assumptions; domain-driven ingestion requires explicit contract validation. Metadata Cataloging for Raster/Vector workflows must intercept this drift before catalog publication.

Execute the following diagnostic sequence to isolate the drift:

bash
# 1. Inspect raw tile metadata for projection extensions
rio info s3://raw-bucket/tile_123.tif | jq '.crs'

# 2. Validate STAC item schema against strict spec
python3 -c "
import pystac, json, sys
item_dict = json.load(open(sys.argv[1]))
pystac.read_dict(item_dict).validate()
print('VALID')
" s3://geo-mesh-prod/sentinel2/items/tile_123.json

# 3. Trace clipping command history for missing creation options
grep "gdalwarp" /var/log/ingest/pipeline.log | tail -n 5

If crs returns null, the clipping step bypassed the necessary creation options. Remediate by injecting explicit options into the warp call. Refer to the official GDAL Warp Documentation for creation option specifications:

bash
gdalwarp \
  -t_srs EPSG:4326 \
  -te -122.5 37.7 -122.3 37.9 \
  -co "TILED=YES" \
  -co "COMPRESS=DEFLATE" \
  -co "BIGTIFF=IF_SAFER" \
  s3://raw-bucket/tile_123.tif /vsimem/tile_123_clipped.tif

The -co "PROFILE=GeoTIFF" option does not control CRS embedding in GDAL; what matters is that -t_srs is set and the output driver is GeoTIFF. Re-run validation after the warp. If the proj:geometry STAC extension field is still missing, populate it explicitly using pyproj and the pystac-ext projection extension before catalog submission.

Spatial Product Versioning & Lifecycle Enforcement

Spatial Product Lifecycle Management dictates that assets transition through stagingvalidationproduction states. Promotion to production requires:

  1. Successful pystac item validation against the STAC specification and any declared extensions
  2. SHA-256 checksum match between source tile and staged COG
  3. Explicit domain owner sign-off via governance webhook

Semantic versioning (MAJOR.MINOR.PATCH) governs deterministic consumer upgrades. Major increments require schema-breaking changes (e.g., CRS migration). Minor increments denote additive metadata or processing improvements. Patch increments cover reprocessing of corrupted tiles without altering the spatial contract.

The STAC specification mandates strict adherence to extension schemas. Consult the official SpatioTemporal Asset Catalog Specification for required fields and extension compliance matrices.

Cross-Team Governance & Escalation Paths

Automated routing of validation failures to the appropriate ownership tier ensures consistent incident response:

Severity Trigger Pattern Automated Action Escalation Path
P3 STACValidationError: Missing optional field Log warning, proceed Domain Data Steward (Slack #geo-catalog)
P2 Null CRS or mismatched projection Halt pipeline, quarantine tile in s3://geo-mesh-quarantine/ Platform Engineer + GIS Steward (Jira GEO-INGEST)
P1 boundary_wkt intersection failure or checksum mismatch Rollback staging, invalidate cache, trigger re-clipping Architecture Review Board + Incident Commander

Idempotent rollback removes the failed version from the catalog index and restores the previous manifest:

python
import pystac

# Load the catalog and remove the failed item version
catalog = pystac.Catalog.from_file("s3://geo-mesh-prod/sentinel2/catalog.json")
failed_item_id = "sentinel2-msi-l2a-v2.1-abc12345"
item = catalog.get_item(failed_item_id, recursive=True)
if item:
    item.get_parent().remove_item(failed_item_id)
    catalog.normalize_and_save(
        root_href="s3://geo-mesh-prod/sentinel2/",
        catalog_type=pystac.CatalogType.SELF_CONTAINED,
    )
    print(f"PRODUCT_ROLLBACK_COMPLETE: {failed_item_id} removed from catalog")

All pipeline executions must be logged with structured JSON output for audit trails and drift analysis.