OCR Data Pipeline¶

The OCR (Open Climate Risk) data pipeline processes climate risk data through a series of coordinated stages, from individual region processing to final tile generation for visualization.

Overview¶

The pipeline transforms raw climate data into risk assessments through four main stages:

Region Processing - Calculate risk metrics for individual geographic regions
Data Aggregation - Combine regional results into consolidated datasets
Statistical Summaries - Generate county and tract-level statistics (optional)
Tile Generation - Create PMTiles for web visualization

Getting Started¶

Prerequisites¶

Python environment with OCR package installed (see installation guide)
AWS credentials (for data access)
Coiled account (for cloud execution, optional)

Tutorial: quick end-to-end (local)¶

This tutorial walks you through a short, practical run that processes one region locally and inspects the output.

Ensure your environment is configured and the package is installed (see installation guide).
Copy an example env and set a local storage path for quick testing:

cp ocr-local.env .env
# For local testing you can set OCR_STORAGE_ROOT to a local path, e.g. ./output/

Run a single-region processing job locally:

ocr process-region y10_x2 --risk-type fire --platform local

Inspect outputs in the storage root (geoparquet files and logs):

ls -la $OCR_STORAGE_ROOT/

If you set OCR_DEBUG=1 you will see detailed logs printed to stdout.

Tutorial: quick end-to-end (Coiled)¶

Use Coiled for parallel, large-scale processing.

Ensure Coiled credentials are set by logging into your account via the Coiled CLI.
Run an example multi-region job on Coiled:

ocr run --region-id y10_x2 --region-id y11_x3 --platform coiled --env-file .env

Monitor the job on Coiled's web UI and check outputs in your OCR_STORAGE_ROOT bucket.

Basic Usage¶

Process a single region locally:

ocr run --region-id y10_x2 --platform local

Process multiple regions on Coiled:

ocr run --region-id y10_x2 --region-id y11_x3 --platform coiled

Process all available regions:

ocr run --all-region-ids --platform coiled

Execution Platforms¶

Local Platform¶

Best for: Development, testing, debugging, small datasets

Runs entirely on your local machine
Uses local temporary directories
No cloud costs or dependencies
Limited by local computational resources
Sequential processing only

Coiled Platform¶

Best for: Production workloads, large-scale processing, parallel execution

Runs on AWS cloud infrastructure
Automatic resource scaling and management
Parallel job execution across multiple workers
Optimized VM types for different workloads
Built-in monitoring and cost tracking

Configuration¶

Environment Setup¶

Create a .env file for your configuration:

# .env file for OCR configuration
# OCR Configuration
OCR_STORAGE_ROOT=s3://your-bucket/
OCR_ENVIRONMENT=QA

Use your configuration file:

ocr run --env-file .env --region-id y10_x2

Key Configuration Components¶

Icechunk Store - Version-controlled data storage backend
Vector Output - Location for processed geoparquet and PMTiles files
Environment - Data version/environment (prod, QA, etc.)
Chunking - Defines valid region boundaries and IDs

CLI Commands¶

ocr¶

Run OCR deployment pipeline on Coiled

Usage:

ocr [OPTIONS] COMMAND [ARGS]...

Options:

  --install-completion  Install completion for the current shell.
  --show-completion     Show completion for the current shell, to copy it or
                        customize the installation.
  --help                Show this message and exit.

Subcommands

aggregate-region-risk-summary-stats: Generate time-horizon based statistical summaries for county and tract level PMTiles creation
create-building-pmtiles: Create PMTiles from the consolidated geoparquet file.
create-pyramid: Create Pyramid
create-regional-pmtiles: Create PMTiles for regional risk statistics (counties and tracts).
ingest-data: Ingest and process input datasets
partition-buildings: Partition buildings geoparquet by state and county FIPS codes.
process-region: Calculate and write risk for a given region to Icechunk CONUS template.
run: Run the OCR deployment pipeline. This will process regions, aggregate geoparquet files,
write-aggregated-region-analysis-files: Write aggregated statistical summaries for each region (county and tract).

ocr aggregate-region-risk-summary-stats¶

Generate time-horizon based statistical summaries for county and tract level PMTiles creation

Usage:

ocr aggregate-region-risk-summary-stats [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.16xlarge]
  --help                         Show this message and exit.

ocr create-building-pmtiles¶

Create PMTiles from the consolidated geoparquet file.

Usage:

ocr create-building-pmtiles [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.8xlarge]
  --disk-size INTEGER            Disk size in GB (Coiled only).  \[default:
                                 250]
  --help                         Show this message and exit.

ocr create-pyramid¶

Create Pyramid

Usage:

ocr create-pyramid [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: m8g.16xlarge]
  --help                         Show this message and exit.

ocr create-regional-pmtiles¶

Create PMTiles for regional risk statistics (counties and tracts).

Usage:

ocr create-regional-pmtiles [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.8xlarge]
  --disk-size INTEGER            Disk size in GB (Coiled only).  \[default:
                                 250]
  --help                         Show this message and exit.

ocr ingest-data¶

Ingest and process input datasets

Usage:

ocr ingest-data [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

Subcommands

download: Download raw source data for a dataset.
list-datasets: List all available datasets that can be ingested.
process: Process downloaded data and upload to S3/Icechunk.
run-all: Run the complete pipeline: download, process, and cleanup.

ocr ingest-data download¶

Download raw source data for a dataset.

Usage:

ocr ingest-data download [OPTIONS] DATASET

Options:

  DATASET    Name of the dataset to download  \[required]
  --dry-run  Preview operations without executing
  --debug    Enable debug logging
  --help     Show this message and exit.

ocr ingest-data list-datasets¶

List all available datasets that can be ingested.

Usage:

ocr ingest-data list-datasets [OPTIONS]

Options:

  --help  Show this message and exit.

ocr ingest-data process¶

Process downloaded data and upload to S3/Icechunk.

Usage:

ocr ingest-data process [OPTIONS] DATASET

Options:

  DATASET                       Name of the dataset to process  \[required]
  --dry-run                     Preview operations without executing
  --use-coiled                  Use Coiled for distributed processing
  --software TEXT               Software environment to use (required if
                                --use-coiled is set)
  --debug                       Enable debug logging
  --overture-data-type TEXT     For overture-maps: which data to process
                                (buildings, addresses, or both)  \[default:
                                both]
  --census-geography-type TEXT  For census-tiger: which geography to process
                                (blocks, tracts, counties, or all)  \[default:
                                all]
  --census-subset-states TEXT   For census-tiger: subset of states to process
                                (e.g., California Oregon)
  --help                        Show this message and exit.

ocr ingest-data run-all¶

Run the complete pipeline: download, process, and cleanup.

Usage:

ocr ingest-data run-all [OPTIONS] DATASET

Options:

  DATASET                       Name of the dataset to process  \[required]
  --dry-run                     Preview operations without executing
  --use-coiled                  Use Coiled for distributed processing
  --debug                       Enable debug logging
  --overture-data-type TEXT     For overture-maps: which data to process
                                (buildings, addresses, or both)  \[default:
                                both]
  --census-geography-type TEXT  For census-tiger: which geography to process
                                (blocks, tracts, counties, or all)  \[default:
                                all]
  --census-subset-states TEXT   For census-tiger: subset of states to process
                                (e.g., California Oregon)
  --help                        Show this message and exit.

ocr partition-buildings¶

Partition buildings geoparquet by state and county FIPS codes.

Usage:

ocr partition-buildings [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.12xlarge]
  --help                         Show this message and exit.

ocr process-region¶

Calculate and write risk for a given region to Icechunk CONUS template.

Usage:

ocr process-region [OPTIONS] REGION_ID

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  REGION_ID                      Region ID to process, e.g., y10_x2
                                 \[required]
  -t, --risk-type [fire]         Type of risk to calculate  \[default: fire]
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
  --init-repo                    Initialize Icechunk repository (if not
                                 already initialized).
  --help                         Show this message and exit.

ocr run¶

Run the OCR deployment pipeline. This will process regions, aggregate geoparquet files, and create PMTiles layers for the specified risk type.

Usage:

ocr run [OPTIONS]

Options:

  -e, --env-file PATH             Path to the environment variables file.
                                  These will be used to set up the
                                  OCRConfiguration
  -r, --region-id TEXT            Region IDs to process, e.g., y10_x2
  --all-region-ids                Process all valid region IDs
  -t, --risk-type [fire]          Type of risk to calculate  \[default: fire]
  --write-regional-stats          Write aggregated statistical summaries for
                                  each region (one file per region type with
                                  stats like averages, medians, percentiles,
                                  and histograms)
  --create-pyramid                Create ndpyramid / multiscale zarr for web-
                                  visualization
  -p, --platform [coiled|local]   Platform to run the pipeline on  \[default:
                                  local]
  --wipe                          Wipe the icechunk and vector data storages
                                  before running the pipeline
  --dispatch-platform [coiled|local]
                                  If set, schedule this run command on the
                                  specified platform instead of running
                                  inline.
  --vm-type TEXT                  VM type override for dispatch-platform
                                  (Coiled only).
  --process-retries INTEGER RANGE
                                  Number of times to retry failed process-
                                  region tasks (Coiled only). 0 disables
                                  retries.  \[default: 2; x>=0]
  --help                          Show this message and exit.

ocr write-aggregated-region-analysis-files¶

Write aggregated statistical summaries for each region (county and tract).

Creates one file per region type containing aggregated statistics for ALL regions, including building counts, average/median risk values, percentiles (p90, p95, p99), and histograms. Outputs in geoparquet, geojson, and csv formats.

Usage:

ocr write-aggregated-region-analysis-files [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: r8g.4xlarge]
  --help                         Show this message and exit.

Pipeline Orchestration¶

`ocr run` - Full Pipeline¶

The main command that orchestrates the complete processing pipeline.

Key Options:

--region-id - Process specific regions (can specify multiple)
--all-region-ids - Process all available regions
--platform - Choose local or coiled execution
--risk-type - Calculate fire or wind risk (default: fire)
--write-region-files - Write regional aggregated summary stats geospatial files.

Examples:

# Development workflow
ocr run --region-id y10_x2 --platform local

# Production processing with statistics
ocr run --all-region-ids --platform coiled --env-file prod.env

# Multi-region wind risk analysis
ocr run --region-id y10_x2 --region-id y11_x3 --risk-type wind --platform coiled

Individual Stage Commands¶

`ocr process-region` - Single Region Processing¶

Process risk calculations for one specific region.

# Process fire risk for region y10_x2
ocr process-region y10_x2 --risk-type fire

# Process with custom environment
ocr process-region y15_x7 --env-file production.env --risk-type wind

`ocr partition-buildings` - Data Consolidation¶

Partition processed geoparquet files by state and county FIPS codes.

ocr partition-buildings --env-file .env

`ocr aggregate-region-risk-summary-stats` - Statistical Summaries¶

Generate county and tract-level risk statistics.

ocr aggregate-region-risk-summary-stats --env-file .env

`ocr create-regional-pmtiles` - Regional Tiles¶

Create PMTiles for county and tract-level visualizations.

ocr create-regional-pmtiles --env-file .env

`ocr create-building-pmtiles` - Building PMTiles¶

Generate PMTiles from the consolidated building dataset.

ocr create-building-pmtiles --env-file .env

`ocr write-aggregated-region-analysis-files` - Write Analysis Files¶

Write aggregated region analysis files (csv, geoparquet and geojson). You can add the flag --write-region-files to ocr run to add this optional step in the pipeline.

ocr write-aggregated-region-analysis-files --env-file .env

Troubleshooting¶

Common Issues¶

Environment configuration issues¶

Error: Missing required environment variables

Solutions:

Verify .env file exists and is properly formatted
Check all required AWS credentials are set
Ensure Coiled credentials are configured (for cloud platform)

Resource and access issues¶

Local Platform¶

Disk space: Check available space in temp directory
Memory: Reduce dataset size or increase system RAM
Permissions: Verify file/directory access rights

Coiled Platform¶

Job failures: Check Coiled credentials and account quotas
AWS access: Verify IAM permissions and credentials
Network: Confirm AWS region and connectivity

OCR Data Pipeline¶

Overview¶

Getting Started¶

Prerequisites¶

Tutorial: quick end-to-end (local)¶

Tutorial: quick end-to-end (Coiled)¶

Basic Usage¶

Execution Platforms¶

Local Platform¶

Coiled Platform¶

Configuration¶

Environment Setup¶

Key Configuration Components¶

CLI Commands¶

ocr¶

ocr aggregate-region-risk-summary-stats¶

ocr create-building-pmtiles¶

ocr create-pyramid¶

ocr create-regional-pmtiles¶

ocr ingest-data¶

ocr ingest-data download¶

ocr ingest-data list-datasets¶

ocr ingest-data process¶

ocr ingest-data run-all¶

ocr partition-buildings¶

ocr process-region¶

ocr run¶

ocr write-aggregated-region-analysis-files¶

Pipeline Orchestration¶

ocr run - Full Pipeline¶

Individual Stage Commands¶

ocr process-region - Single Region Processing¶

ocr partition-buildings - Data Consolidation¶

ocr aggregate-region-risk-summary-stats - Statistical Summaries¶

ocr create-regional-pmtiles - Regional Tiles¶

ocr create-building-pmtiles - Building PMTiles¶

ocr write-aggregated-region-analysis-files - Write Analysis Files¶

Troubleshooting¶

Common Issues¶

Environment configuration issues¶

Resource and access issues¶

Local Platform¶

Coiled Platform¶

`ocr run` - Full Pipeline¶

`ocr process-region` - Single Region Processing¶

`ocr partition-buildings` - Data Consolidation¶

`ocr aggregate-region-risk-summary-stats` - Statistical Summaries¶

`ocr create-regional-pmtiles` - Regional Tiles¶

`ocr create-building-pmtiles` - Building PMTiles¶

`ocr write-aggregated-region-analysis-files` - Write Analysis Files¶