Input Dataset Ingestion¶
This guide covers how to ingest and process input datasets for the OCR (Open Climate Risk) project using the unified CLI infrastructure.
Overview¶
The input dataset infrastructure provides a consistent interface for ingesting both tensor (raster/Icechunk) and vector (GeoParquet) datasets:
Quick Start¶
Discovery¶
List all available datasets:
Processing¶
Process a dataset (always dry run first to preview):
# Preview operations (recommended first step)
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run
# Execute the full pipeline
pixi run ocr ingest-data run-all scott-et-al-2024
# Use Coiled for distributed processing
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled
Dataset-Specific Options¶
Different datasets support different processing options:
# Vector datasets: Overture Maps - select data type
pixi run ocr ingest-data process overture-maps --overture-data-type buildings
# Vector datasets: Census TIGER - select geography and states
pixi run ocr ingest-data process census-tiger \
--census-geography-type tracts \
--census-subset-states California --census-subset-states Oregon
Available Datasets¶
Our processed input dataset have been transfered to a public AWS bucket in us-west-2 hosted by the Source Cooperative project.
Tensor Datasets (Raster/Icechunk)¶
scott-et-al-2024¶
USFS Wildfire Risk to Communities (2nd Edition)
- RDS ID: RDS-2020-0016-02
- Version: 2024-V2
- Source: USFS Research Data Archive
- Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
- Coverage: CONUS
- Variables: BP (Burn Probability), CRPS (Conditional Risk to Potential Structures), CFL (Conditional Flame Length), Exposure, FLEP4, FLEP8, RPS (Relative Proportion Spread), WHP (Wildfire Hazard Potential)
Pipeline:
- Download 8 TIFF files from USFS Box (one per variable)
- Merge TIFFs into Icechunk store (EPSG:5070, native resolution)
- Reproject to EPSG:4326 at 30m resolution
Usage:
# Full pipeline
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled
# Individual steps
pixi run ocr ingest-data download scott-et-al-2024
pixi run ocr ingest-data process scott-et-al-2024 --use-coiled
Outputs:
- Raw TIFFs:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02/input_tif/ - Native Icechunk:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02_all_vars_merge_icechunk/ - Reprojected:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/scott-et-al-2024-30m-4326.icechunk/
riley-et-al-2025¶
USFS Probabilistic Wildfire Risk - 2011 & 2047 Climate Runs
- RDS ID: RDS-2025-0006
- Version: 2025
- Source: USFS Research Data Archive
- Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
- Coverage: CONUS
- Variables: Multiple climate scenarios (2011 baseline, 2047 projections)
Pipeline:
- Download TIFF files for both time periods
- Process and merge into Icechunk stores
- Reproject to EPSG:4326 at 30m resolution
Usage:
Outputs:
- Reprojected:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/riley-et-al-2025-30m-4326.icechunk/
dillon-et-al-2023¶
USFS Spatial Datasets of Probabilistic Wildfire Risk Components (270m, 3rd Edition)
- RDS ID: RDS-2016-0034-3
- Version: 2023
- Source: USFS Research Data Archive
- Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
- Coverage: CONUS
- Variables: BP, FLP1-6 (Flame Length Probability levels)
Pipeline:
- Download ZIP archive and extract TIFFs
- Upload TIFFs to S3 and merge into Icechunk
- Reproject to EPSG:4326 at 30m resolution
Usage:
Outputs:
- Raw TIFFs:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/raw-input-tiffs/ - Native Icechunk:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-270m-5070.icechunk/ - Reprojected:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-30m-4326.icechunk/
Vector Datasets (GeoParquet)¶
overture-maps¶
Overture Maps Building and Address Data for CONUS
- Release: 2025-09-24.0
- Source: Overture Maps Foundation
- Format: GeoParquet (WKB geometry, zstd compression)
- Coverage: CONUS (spatially filtered from global dataset)
- Data Types: Buildings (bbox + geometry), Addresses (full attributes), Region-Tagged Buildings (buildings + census identifiers)
Pipeline:
- Query Overture S3 bucket directly (no download step)
- Filter by CONUS bounding box using DuckDB
- Write subsetted data to carbonplan-ocr S3 bucket
- (If buildings processed) Perform spatial join with US Census blocks to add geographic identifiers
Region-Tagged Buildings Processing:
When buildings are processed, an additional dataset is automatically created that tags each building with census geographic identifiers:
- Loads census FIPS lookup table for state/county names
- Creates spatial indexes on buildings and census blocks
- Performs bbox-filtered spatial join using
ST_Intersects - Adds identifiers at multiple administrative levels: state, county, tract, block group, and block
Usage:
# Both buildings and addresses (default)
# Also creates region-tagged buildings automatically
pixi run ocr ingest-data run-all overture-maps
# Only buildings (also creates region-tagged buildings)
pixi run ocr ingest-data process overture-maps --overture-data-type buildings
# Only addresses (no region tagging)
pixi run ocr ingest-data process overture-maps --overture-data-type addresses
# Dry run
pixi run ocr ingest-data run-all overture-maps --dry-run
# Use Coiled for distributed processing
pixi run ocr ingest-data run-all overture-maps --use-coiled
Outputs:
- Buildings:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-buildings-2025-09-24.0.parquet - Addresses:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-addresses-2025-09-24.0.parquet - Region-Tagged Buildings:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-region-tagged-buildings-2025-09-24.0.parquet
census-tiger¶
US Census TIGER/Line Geographic Boundaries
- Vintage: 2024 (tracts/counties), 2025 (blocks)
- Source: US Census Bureau TIGER/Line
- Format: GeoParquet (WKB geometry, zstd compression, schema v1.1.0)
- Coverage: CONUS + DC (49 states/territories, excludes Alaska & Hawaii)
- Geography Types: Blocks, Tracts, Counties
Pipeline:
- Download TIGER/Line shapefiles from Census Bureau (per-state for blocks/tracts)
- Convert to GeoParquet with spatial metadata
- Aggregate tract files using DuckDB
Usage:
# All geography types (default)
pixi run ocr ingest-data run-all census-tiger
# Only counties
pixi run ocr ingest-data process census-tiger --census-geography-type counties
# Tracts for specific states
pixi run ocr ingest-data process census-tiger --census-geography-type tracts \
--census-subset-states California --census-subset-states Oregon
# Dry run
pixi run ocr ingest-data run-all census-tiger --dry-run
Outputs:
- Blocks:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/blocks/blocks.parquet - Tracts (per-state):
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/FIPS/FIPS_*.parquet - Tracts (aggregated):
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/tracts.parquet - Counties:
s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/counties/counties.parquet
CLI Reference¶
Commands¶
list-datasets: Show all available datasetsdownload <dataset>: Download raw source data (tensor datasets only)process <dataset>: Process and upload to S3/Icechunkrun-all <dataset>: Complete pipeline (download + process + cleanup)
Global Options¶
--dry-run: Preview operations without executing (recommended before any real run)--debug: Enable debug logging for troubleshooting
Tensor Dataset Options¶
--use-coiled: Use Coiled for distributed processing (USFS datasets)
Vector Dataset Options¶
Overture Maps¶
--overture-data-type <type>: Which data to processbuildings: Only building geometriesaddresses: Only address pointsboth: Both datasets (default)
Census TIGER¶
--census-geography-type <type>: Which geography to processblocks: Census blockstracts: Census tracts (per-state + aggregated)counties: County boundariesall: All three types (default)
--census-subset-states <state> [<state> ...]: Process only specific states- Repeat option for each state:
--census-subset-states California --census-subset-states Oregon - Use full state names (case-sensitive):
California,Oregon,Washington, etc.
- Repeat option for each state:
Configuration¶
Environment Variables¶
All settings can be overridden via environment variables:
# S3 configuration
export OCR_INPUT_DATASET_S3_BUCKET=my-bucket
export OCR_INPUT_DATASET_S3_REGION=us-east-1
export OCR_INPUT_DATASET_BASE_PREFIX=custom/prefix
# Processing options
export OCR_INPUT_DATASET_CHUNK_SIZE=16384
export OCR_INPUT_DATASET_DEBUG=true
# Temporary storage
export OCR_INPUT_DATASET_TEMP_DIR=/path/to/temp
Configuration Class¶
The InputDatasetConfig class (Pydantic model) provides:
- Type validation for all settings
- Automatic environment variable loading (prefix:
OCR_INPUT_DATASET_) - Default values for all options
- Case-insensitive environment variable names
Troubleshooting¶
Dry Run First¶
Always test with --dry-run before executing:
This previews all operations without making changes.