Skip to content

Input Dataset Ingestion

This guide covers how to ingest and process input datasets for the OCR (Open Climate Risk) project using the unified CLI infrastructure.

Overview

The input dataset infrastructure provides a consistent interface for ingesting both tensor (raster/Icechunk) and vector (GeoParquet) datasets:

Quick Start

Discovery

List all available datasets:

pixi run ocr ingest-data list-datasets

Processing

Process a dataset (always dry run first to preview):

# Preview operations (recommended first step)
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run

# Execute the full pipeline
pixi run ocr ingest-data run-all scott-et-al-2024

# Use Coiled for distributed processing
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled

Dataset-Specific Options

Different datasets support different processing options:

# Vector datasets: Overture Maps - select data type
pixi run ocr ingest-data process overture-maps --overture-data-type buildings

# Vector datasets: Census TIGER - select geography and states
pixi run ocr ingest-data process census-tiger \
  --census-geography-type tracts \
  --census-subset-states California --census-subset-states Oregon

Available Datasets

Our processed input dataset have been transfered to a public AWS bucket in us-west-2 hosted by the Source Cooperative project.

Tensor Datasets (Raster/Icechunk)

scott-et-al-2024

USFS Wildfire Risk to Communities (2nd Edition)

  • RDS ID: RDS-2020-0016-02
  • Version: 2024-V2
  • Source: USFS Research Data Archive
  • Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
  • Coverage: CONUS
  • Variables: BP (Burn Probability), CRPS (Conditional Risk to Potential Structures), CFL (Conditional Flame Length), Exposure, FLEP4, FLEP8, RPS (Relative Proportion Spread), WHP (Wildfire Hazard Potential)

Pipeline:

  1. Download 8 TIFF files from USFS Box (one per variable)
  2. Merge TIFFs into Icechunk store (EPSG:5070, native resolution)
  3. Reproject to EPSG:4326 at 30m resolution

Usage:

# Full pipeline
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled

# Individual steps
pixi run ocr ingest-data download scott-et-al-2024
pixi run ocr ingest-data process scott-et-al-2024 --use-coiled

Outputs:

  • Raw TIFFs: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02/input_tif/
  • Native Icechunk: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02_all_vars_merge_icechunk/
  • Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/scott-et-al-2024-30m-4326.icechunk/

riley-et-al-2025

USFS Probabilistic Wildfire Risk - 2011 & 2047 Climate Runs

  • RDS ID: RDS-2025-0006
  • Version: 2025
  • Source: USFS Research Data Archive
  • Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
  • Coverage: CONUS
  • Variables: Multiple climate scenarios (2011 baseline, 2047 projections)

Pipeline:

  1. Download TIFF files for both time periods
  2. Process and merge into Icechunk stores
  3. Reproject to EPSG:4326 at 30m resolution

Usage:

pixi run ocr ingest-data run-all riley-et-al-2025 --use-coiled

Outputs:

  • Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/riley-et-al-2025-30m-4326.icechunk/

dillon-et-al-2023

USFS Spatial Datasets of Probabilistic Wildfire Risk Components (270m, 3rd Edition)

  • RDS ID: RDS-2016-0034-3
  • Version: 2023
  • Source: USFS Research Data Archive
  • Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
  • Coverage: CONUS
  • Variables: BP, FLP1-6 (Flame Length Probability levels)

Pipeline:

  1. Download ZIP archive and extract TIFFs
  2. Upload TIFFs to S3 and merge into Icechunk
  3. Reproject to EPSG:4326 at 30m resolution

Usage:

pixi run ocr ingest-data run-all dillon-et-al-2023 --use-coiled

Outputs:

  • Raw TIFFs: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/raw-input-tiffs/
  • Native Icechunk: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-270m-5070.icechunk/
  • Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-30m-4326.icechunk/

Vector Datasets (GeoParquet)

overture-maps

Overture Maps Building and Address Data for CONUS

  • Release: 2025-09-24.0
  • Source: Overture Maps Foundation
  • Format: GeoParquet (WKB geometry, zstd compression)
  • Coverage: CONUS (spatially filtered from global dataset)
  • Data Types: Buildings (bbox + geometry), Addresses (full attributes), Region-Tagged Buildings (buildings + census identifiers)

Pipeline:

  1. Query Overture S3 bucket directly (no download step)
  2. Filter by CONUS bounding box using DuckDB
  3. Write subsetted data to carbonplan-ocr S3 bucket
  4. (If buildings processed) Perform spatial join with US Census blocks to add geographic identifiers

Region-Tagged Buildings Processing:

When buildings are processed, an additional dataset is automatically created that tags each building with census geographic identifiers:

  • Loads census FIPS lookup table for state/county names
  • Creates spatial indexes on buildings and census blocks
  • Performs bbox-filtered spatial join using ST_Intersects
  • Adds identifiers at multiple administrative levels: state, county, tract, block group, and block

Usage:

# Both buildings and addresses (default)
# Also creates region-tagged buildings automatically
pixi run ocr ingest-data run-all overture-maps

# Only buildings (also creates region-tagged buildings)
pixi run ocr ingest-data process overture-maps --overture-data-type buildings

# Only addresses (no region tagging)
pixi run ocr ingest-data process overture-maps --overture-data-type addresses

# Dry run
pixi run ocr ingest-data run-all overture-maps --dry-run

# Use Coiled for distributed processing
pixi run ocr ingest-data run-all overture-maps --use-coiled

Outputs:

  • Buildings: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-buildings-2025-09-24.0.parquet
  • Addresses: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-addresses-2025-09-24.0.parquet
  • Region-Tagged Buildings: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-region-tagged-buildings-2025-09-24.0.parquet

census-tiger

US Census TIGER/Line Geographic Boundaries

  • Vintage: 2024 (tracts/counties), 2025 (blocks)
  • Source: US Census Bureau TIGER/Line
  • Format: GeoParquet (WKB geometry, zstd compression, schema v1.1.0)
  • Coverage: CONUS + DC (49 states/territories, excludes Alaska & Hawaii)
  • Geography Types: Blocks, Tracts, Counties

Pipeline:

  1. Download TIGER/Line shapefiles from Census Bureau (per-state for blocks/tracts)
  2. Convert to GeoParquet with spatial metadata
  3. Aggregate tract files using DuckDB

Usage:

# All geography types (default)
pixi run ocr ingest-data run-all census-tiger

# Only counties
pixi run ocr ingest-data process census-tiger --census-geography-type counties

# Tracts for specific states
pixi run ocr ingest-data process census-tiger --census-geography-type tracts \
  --census-subset-states California --census-subset-states Oregon

# Dry run
pixi run ocr ingest-data run-all census-tiger --dry-run

Outputs:

  • Blocks: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/blocks/blocks.parquet
  • Tracts (per-state): s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/FIPS/FIPS_*.parquet
  • Tracts (aggregated): s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/tracts.parquet
  • Counties: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/counties/counties.parquet

CLI Reference

Commands

  • list-datasets: Show all available datasets
  • download <dataset>: Download raw source data (tensor datasets only)
  • process <dataset>: Process and upload to S3/Icechunk
  • run-all <dataset>: Complete pipeline (download + process + cleanup)

Global Options

  • --dry-run: Preview operations without executing (recommended before any real run)
  • --debug: Enable debug logging for troubleshooting

Tensor Dataset Options

  • --use-coiled: Use Coiled for distributed processing (USFS datasets)

Vector Dataset Options

Overture Maps

  • --overture-data-type <type>: Which data to process
    • buildings: Only building geometries
    • addresses: Only address points
    • both: Both datasets (default)

Census TIGER

  • --census-geography-type <type>: Which geography to process
    • blocks: Census blocks
    • tracts: Census tracts (per-state + aggregated)
    • counties: County boundaries
    • all: All three types (default)
  • --census-subset-states <state> [<state> ...]: Process only specific states
    • Repeat option for each state: --census-subset-states California --census-subset-states Oregon
    • Use full state names (case-sensitive): California, Oregon, Washington, etc.

Configuration

Environment Variables

All settings can be overridden via environment variables:

# S3 configuration
export OCR_INPUT_DATASET_S3_BUCKET=my-bucket
export OCR_INPUT_DATASET_S3_REGION=us-east-1
export OCR_INPUT_DATASET_BASE_PREFIX=custom/prefix

# Processing options
export OCR_INPUT_DATASET_CHUNK_SIZE=16384
export OCR_INPUT_DATASET_DEBUG=true

# Temporary storage
export OCR_INPUT_DATASET_TEMP_DIR=/path/to/temp

Configuration Class

The InputDatasetConfig class (Pydantic model) provides:

  • Type validation for all settings
  • Automatic environment variable loading (prefix: OCR_INPUT_DATASET_)
  • Default values for all options
  • Case-insensitive environment variable names

Troubleshooting

Dry Run First

Always test with --dry-run before executing:

ocr ingest-data run-all <dataset> --dry-run

This previews all operations without making changes.