Input Dataset Ingestion¶

This guide covers how to ingest and process input datasets for the OCR (Open Climate Risk) project using the unified CLI infrastructure.

Overview¶

The input dataset infrastructure provides a consistent interface for ingesting both tensor (raster/Icechunk) and vector (GeoParquet) datasets:

Quick Start¶

Discovery¶

List all available datasets:

pixi run ocr ingest-data list-datasets

Processing¶

Process a dataset (always dry run first to preview):

# Preview operations (recommended first step)
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run

# Execute the full pipeline
pixi run ocr ingest-data run-all scott-et-al-2024

# Use Coiled for distributed processing
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled

Dataset-Specific Options¶

Different datasets support different processing options:

# Vector datasets: Overture Maps - select data type
pixi run ocr ingest-data process overture-maps --overture-data-type buildings

# Vector datasets: Census TIGER - select geography and states
pixi run ocr ingest-data process census-tiger \
  --census-geography-type tracts \
  --census-subset-states California --census-subset-states Oregon

Available Datasets¶

Our processed input dataset have been transfered to a public AWS bucket in us-west-2 hosted by the Source Cooperative project.

Tensor Datasets (Raster/Icechunk)¶

scott-et-al-2024¶

USFS Wildfire Risk to Communities (2^nd Edition)

RDS ID: RDS-2020-0016-02
Version: 2024-V2
Source: USFS Research Data Archive
Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
Coverage: CONUS
Variables: BP (Burn Probability), CRPS (Conditional Risk to Potential Structures), CFL (Conditional Flame Length), Exposure, FLEP4, FLEP8, RPS (Relative Proportion Spread), WHP (Wildfire Hazard Potential)

Pipeline:

Download 8 TIFF files from USFS Box (one per variable)
Merge TIFFs into Icechunk store (EPSG:5070, native resolution)
Reproject to EPSG:4326 at 30m resolution

Usage:

# Full pipeline
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled

# Individual steps
pixi run ocr ingest-data download scott-et-al-2024
pixi run ocr ingest-data process scott-et-al-2024 --use-coiled

Outputs:

Raw TIFFs: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02/input_tif/
Native Icechunk: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02_all_vars_merge_icechunk/
Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/scott-et-al-2024-30m-4326.icechunk/

riley-et-al-2025¶

USFS Probabilistic Wildfire Risk - 2011 & 2047 Climate Runs

RDS ID: RDS-2025-0006
Version: 2025
Source: USFS Research Data Archive
Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
Coverage: CONUS
Variables: Multiple climate scenarios (2011 baseline, 2047 projections)

Pipeline:

Download TIFF files for both time periods
Process and merge into Icechunk stores
Reproject to EPSG:4326 at 30m resolution

Usage:

pixi run ocr ingest-data run-all riley-et-al-2025 --use-coiled

Outputs:

Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/riley-et-al-2025-30m-4326.icechunk/

dillon-et-al-2023¶

USFS Spatial Datasets of Probabilistic Wildfire Risk Components (270m, 3^rd Edition)

RDS ID: RDS-2016-0034-3
Version: 2023
Source: USFS Research Data Archive
Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
Coverage: CONUS
Variables: BP, FLP1-6 (Flame Length Probability levels)

Pipeline:

Download ZIP archive and extract TIFFs
Upload TIFFs to S3 and merge into Icechunk
Reproject to EPSG:4326 at 30m resolution

Usage:

pixi run ocr ingest-data run-all dillon-et-al-2023 --use-coiled

Outputs:

Raw TIFFs: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/raw-input-tiffs/
Native Icechunk: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-270m-5070.icechunk/
Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-30m-4326.icechunk/

Vector Datasets (GeoParquet)¶

overture-maps¶

Overture Maps Building and Address Data for CONUS

Release: 2025-09-24.0
Source: Overture Maps Foundation
Format: GeoParquet (WKB geometry, zstd compression)
Coverage: CONUS (spatially filtered from global dataset)
Data Types: Buildings (bbox + geometry), Addresses (full attributes), Region-Tagged Buildings (buildings + census identifiers)

Pipeline:

Query Overture S3 bucket directly (no download step)
Filter by CONUS bounding box using DuckDB
Write subsetted data to carbonplan-ocr S3 bucket
(If buildings processed) Perform spatial join with US Census blocks to add geographic identifiers

Region-Tagged Buildings Processing:

When buildings are processed, an additional dataset is automatically created that tags each building with census geographic identifiers:

Loads census FIPS lookup table for state/county names
Creates spatial indexes on buildings and census blocks
Performs bbox-filtered spatial join using ST_Intersects
Adds identifiers at multiple administrative levels: state, county, tract, block group, and block

Usage:

# Both buildings and addresses (default)
# Also creates region-tagged buildings automatically
pixi run ocr ingest-data run-all overture-maps

# Only buildings (also creates region-tagged buildings)
pixi run ocr ingest-data process overture-maps --overture-data-type buildings

# Only addresses (no region tagging)
pixi run ocr ingest-data process overture-maps --overture-data-type addresses

# Dry run
pixi run ocr ingest-data run-all overture-maps --dry-run

# Use Coiled for distributed processing
pixi run ocr ingest-data run-all overture-maps --use-coiled

Outputs:

Buildings: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-buildings-2025-09-24.0.parquet
Addresses: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-addresses-2025-09-24.0.parquet
Region-Tagged Buildings: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-region-tagged-buildings-2025-09-24.0.parquet

census-tiger¶

US Census TIGER/Line Geographic Boundaries

Vintage: 2024 (tracts/counties), 2025 (blocks)
Source: US Census Bureau TIGER/Line
Format: GeoParquet (WKB geometry, zstd compression, schema v1.1.0)
Coverage: CONUS + DC (49 states/territories, excludes Alaska & Hawaii)
Geography Types: Blocks, Tracts, Counties

Pipeline:

Download TIGER/Line shapefiles from Census Bureau (per-state for blocks/tracts)
Convert to GeoParquet with spatial metadata
Aggregate tract files using DuckDB

Usage:

# All geography types (default)
pixi run ocr ingest-data run-all census-tiger

# Only counties
pixi run ocr ingest-data process census-tiger --census-geography-type counties

# Tracts for specific states
pixi run ocr ingest-data process census-tiger --census-geography-type tracts \
  --census-subset-states California --census-subset-states Oregon

# Dry run
pixi run ocr ingest-data run-all census-tiger --dry-run

Outputs:

Blocks: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/blocks/blocks.parquet
Tracts (per-state): s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/FIPS/FIPS_*.parquet
Tracts (aggregated): s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/tracts.parquet
Counties: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/counties/counties.parquet

CLI Reference¶

Commands¶

list-datasets: Show all available datasets
download <dataset>: Download raw source data (tensor datasets only)
process <dataset>: Process and upload to S3/Icechunk
run-all <dataset>: Complete pipeline (download + process + cleanup)

Global Options¶

--dry-run: Preview operations without executing (recommended before any real run)
--debug: Enable debug logging for troubleshooting

Tensor Dataset Options¶

--use-coiled: Use Coiled for distributed processing (USFS datasets)

Vector Dataset Options¶

Overture Maps¶

--overture-data-type <type>: Which data to process
- buildings: Only building geometries
- addresses: Only address points
- both: Both datasets (default)

Census TIGER¶

--census-geography-type <type>: Which geography to process
- blocks: Census blocks
- tracts: Census tracts (per-state + aggregated)
- counties: County boundaries
- all: All three types (default)
--census-subset-states <state> [<state> ...]: Process only specific states
- Repeat option for each state: --census-subset-states California --census-subset-states Oregon
- Use full state names (case-sensitive): California, Oregon, Washington, etc.

Configuration¶

Environment Variables¶

All settings can be overridden via environment variables:

# S3 configuration
export OCR_INPUT_DATASET_S3_BUCKET=my-bucket
export OCR_INPUT_DATASET_S3_REGION=us-east-1
export OCR_INPUT_DATASET_BASE_PREFIX=custom/prefix

# Processing options
export OCR_INPUT_DATASET_CHUNK_SIZE=16384
export OCR_INPUT_DATASET_DEBUG=true

# Temporary storage
export OCR_INPUT_DATASET_TEMP_DIR=/path/to/temp

Configuration Class¶

The InputDatasetConfig class (Pydantic model) provides:

Type validation for all settings
Automatic environment variable loading (prefix: OCR_INPUT_DATASET_)
Default values for all options
Case-insensitive environment variable names

Troubleshooting¶

Dry Run First¶

Always test with --dry-run before executing:

ocr ingest-data run-all <dataset> --dry-run

This previews all operations without making changes.