Skip to content

OCR Data Pipeline

The OCR (Open Climate Risk) data pipeline processes climate risk data through a series of coordinated stages, from individual region processing to final tile generation for visualization.

Overview

The pipeline transforms raw climate data into risk assessments through four main stages:

  1. Region Processing - Calculate risk metrics for individual geographic regions
  2. Data Aggregation - Combine regional results into consolidated datasets
  3. Statistical Summaries - Generate county and tract-level statistics (optional)
  4. Tile Generation - Create PMTiles for web visualization

Getting Started

Prerequisites

  • Python environment with OCR package installed (see installation guide)
  • AWS credentials (for data access)
  • Coiled account (for cloud execution, optional)

Tutorial: quick end-to-end (local)

This tutorial walks you through a short, practical run that processes one region locally and inspects the output.

  1. Ensure your environment is configured and the package is installed (see installation guide).
  2. Copy an example env and set a local storage path for quick testing:
cp ocr-local.env .env
# For local testing you can set OCR_STORAGE_ROOT to a local path, e.g. ./output/
  1. Run a single-region processing job locally:
ocr process-region y10_x2 --risk-type fire --platform local
  1. Inspect outputs in the storage root (geoparquet files and logs):
ls -la $OCR_STORAGE_ROOT/
  1. If you set OCR_DEBUG=1 you will see detailed logs printed to stdout.

Tutorial: quick end-to-end (Coiled)

Use Coiled for parallel, large-scale processing.

  1. Ensure Coiled credentials are set by logging into your account via the Coiled CLI.
  2. Run an example multi-region job on Coiled:
ocr run --region-id y10_x2 --region-id y11_x3 --platform coiled --env-file .env
  1. Monitor the job on Coiled's web UI and check outputs in your OCR_STORAGE_ROOT bucket.

Basic Usage

Process a single region locally:

ocr run --region-id y10_x2 --platform local

Process multiple regions on Coiled:

ocr run --region-id y10_x2 --region-id y11_x3 --platform coiled

Process all available regions:

ocr run --all-region-ids --platform coiled

Execution Platforms

Local Platform

Best for: Development, testing, debugging, small datasets

  • Runs entirely on your local machine
  • Uses local temporary directories
  • No cloud costs or dependencies
  • Limited by local computational resources
  • Sequential processing only

Coiled Platform

Best for: Production workloads, large-scale processing, parallel execution

  • Runs on AWS cloud infrastructure
  • Automatic resource scaling and management
  • Parallel job execution across multiple workers
  • Optimized VM types for different workloads
  • Built-in monitoring and cost tracking

Configuration

Environment Setup

Create a .env file for your configuration:

# .env file for OCR configuration
# OCR Configuration
OCR_STORAGE_ROOT=s3://your-bucket/
OCR_ENVIRONMENT=QA

Use your configuration file:

ocr run --env-file .env --region-id y10_x2

Key Configuration Components

  • Icechunk Store - Version-controlled data storage backend
  • Vector Output - Location for processed geoparquet and PMTiles files
  • Environment - Data version/environment (prod, QA, etc.)
  • Chunking - Defines valid region boundaries and IDs

CLI Commands

ocr

Run OCR deployment pipeline on Coiled

Usage:

ocr [OPTIONS] COMMAND [ARGS]...

Options:

  --install-completion  Install completion for the current shell.
  --show-completion     Show completion for the current shell, to copy it or
                        customize the installation.
  --help                Show this message and exit.

Subcommands

ocr aggregate-region-risk-summary-stats

Generate time-horizon based statistical summaries for county and tract level PMTiles creation

Usage:

ocr aggregate-region-risk-summary-stats [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.16xlarge]
  --help                         Show this message and exit.

ocr create-building-pmtiles

Create PMTiles from the consolidated geoparquet file.

Usage:

ocr create-building-pmtiles [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.8xlarge]
  --disk-size INTEGER            Disk size in GB (Coiled only).  \[default:
                                 250]
  --help                         Show this message and exit.

ocr create-pyramid

Create Pyramid

Usage:

ocr create-pyramid [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: m8g.16xlarge]
  --help                         Show this message and exit.

ocr create-regional-pmtiles

Create PMTiles for regional risk statistics (counties and tracts).

Usage:

ocr create-regional-pmtiles [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.8xlarge]
  --disk-size INTEGER            Disk size in GB (Coiled only).  \[default:
                                 250]
  --help                         Show this message and exit.

ocr ingest-data

Ingest and process input datasets

Usage:

ocr ingest-data [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

Subcommands

  • download: Download raw source data for a dataset.
  • list-datasets: List all available datasets that can be ingested.
  • process: Process downloaded data and upload to S3/Icechunk.
  • run-all: Run the complete pipeline: download, process, and cleanup.

ocr ingest-data download

Download raw source data for a dataset.

Usage:

ocr ingest-data download [OPTIONS] DATASET

Options:

  DATASET    Name of the dataset to download  \[required]
  --dry-run  Preview operations without executing
  --debug    Enable debug logging
  --help     Show this message and exit.

ocr ingest-data list-datasets

List all available datasets that can be ingested.

Usage:

ocr ingest-data list-datasets [OPTIONS]

Options:

  --help  Show this message and exit.

ocr ingest-data process

Process downloaded data and upload to S3/Icechunk.

Usage:

ocr ingest-data process [OPTIONS] DATASET

Options:

  DATASET                       Name of the dataset to process  \[required]
  --dry-run                     Preview operations without executing
  --use-coiled                  Use Coiled for distributed processing
  --software TEXT               Software environment to use (required if
                                --use-coiled is set)
  --debug                       Enable debug logging
  --overture-data-type TEXT     For overture-maps: which data to process
                                (buildings, addresses, or both)  \[default:
                                both]
  --census-geography-type TEXT  For census-tiger: which geography to process
                                (blocks, tracts, counties, or all)  \[default:
                                all]
  --census-subset-states TEXT   For census-tiger: subset of states to process
                                (e.g., California Oregon)
  --help                        Show this message and exit.

ocr ingest-data run-all

Run the complete pipeline: download, process, and cleanup.

Usage:

ocr ingest-data run-all [OPTIONS] DATASET

Options:

  DATASET                       Name of the dataset to process  \[required]
  --dry-run                     Preview operations without executing
  --use-coiled                  Use Coiled for distributed processing
  --debug                       Enable debug logging
  --overture-data-type TEXT     For overture-maps: which data to process
                                (buildings, addresses, or both)  \[default:
                                both]
  --census-geography-type TEXT  For census-tiger: which geography to process
                                (blocks, tracts, counties, or all)  \[default:
                                all]
  --census-subset-states TEXT   For census-tiger: subset of states to process
                                (e.g., California Oregon)
  --help                        Show this message and exit.

ocr partition-buildings

Partition buildings geoparquet by state and county FIPS codes.

Usage:

ocr partition-buildings [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.12xlarge]
  --help                         Show this message and exit.

ocr process-region

Calculate and write risk for a given region to Icechunk CONUS template.

Usage:

ocr process-region [OPTIONS] REGION_ID

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  REGION_ID                      Region ID to process, e.g., y10_x2
                                 \[required]
  -t, --risk-type [fire]         Type of risk to calculate  \[default: fire]
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
  --init-repo                    Initialize Icechunk repository (if not
                                 already initialized).
  --help                         Show this message and exit.

ocr run

Run the OCR deployment pipeline. This will process regions, aggregate geoparquet files, and create PMTiles layers for the specified risk type.

Usage:

ocr run [OPTIONS]

Options:

  -e, --env-file PATH             Path to the environment variables file.
                                  These will be used to set up the
                                  OCRConfiguration
  -r, --region-id TEXT            Region IDs to process, e.g., y10_x2
  --all-region-ids                Process all valid region IDs
  -t, --risk-type [fire]          Type of risk to calculate  \[default: fire]
  --write-regional-stats          Write aggregated statistical summaries for
                                  each region (one file per region type with
                                  stats like averages, medians, percentiles,
                                  and histograms)
  --create-pyramid                Create ndpyramid / multiscale zarr for web-
                                  visualization
  -p, --platform [coiled|local]   Platform to run the pipeline on  \[default:
                                  local]
  --wipe                          Wipe the icechunk and vector data storages
                                  before running the pipeline
  --dispatch-platform [coiled|local]
                                  If set, schedule this run command on the
                                  specified platform instead of running
                                  inline.
  --vm-type TEXT                  VM type override for dispatch-platform
                                  (Coiled only).
  --process-retries INTEGER RANGE
                                  Number of times to retry failed process-
                                  region tasks (Coiled only). 0 disables
                                  retries.  \[default: 2; x>=0]
  --help                          Show this message and exit.

ocr write-aggregated-region-analysis-files

Write aggregated statistical summaries for each region (county and tract).

Creates one file per region type containing aggregated statistics for ALL regions, including building counts, average/median risk values, percentiles (p90, p95, p99), and histograms. Outputs in geoparquet, geojson, and csv formats.

Usage:

ocr write-aggregated-region-analysis-files [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: r8g.4xlarge]
  --help                         Show this message and exit.

Pipeline Orchestration

ocr run - Full Pipeline

The main command that orchestrates the complete processing pipeline.

Key Options:

  • --region-id - Process specific regions (can specify multiple)
  • --all-region-ids - Process all available regions
  • --platform - Choose local or coiled execution
  • --risk-type - Calculate fire or wind risk (default: fire)
  • --write-region-files - Write regional aggregated summary stats geospatial files.

Examples:

# Development workflow
ocr run --region-id y10_x2 --platform local

# Production processing with statistics
ocr run --all-region-ids --platform coiled --env-file prod.env

# Multi-region wind risk analysis
ocr run --region-id y10_x2 --region-id y11_x3 --risk-type wind --platform coiled

Individual Stage Commands

ocr process-region - Single Region Processing

Process risk calculations for one specific region.

# Process fire risk for region y10_x2
ocr process-region y10_x2 --risk-type fire

# Process with custom environment
ocr process-region y15_x7 --env-file production.env --risk-type wind

ocr partition-buildings - Data Consolidation

Partition processed geoparquet files by state and county FIPS codes.

ocr partition-buildings --env-file .env

ocr aggregate-region-risk-summary-stats - Statistical Summaries

Generate county and tract-level risk statistics.

ocr aggregate-region-risk-summary-stats --env-file .env

ocr create-regional-pmtiles - Regional Tiles

Create PMTiles for county and tract-level visualizations.

ocr create-regional-pmtiles --env-file .env

ocr create-building-pmtiles - Building PMTiles

Generate PMTiles from the consolidated building dataset.

ocr create-building-pmtiles --env-file .env

ocr write-aggregated-region-analysis-files - Write Analysis Files

Write aggregated region analysis files (csv, geoparquet and geojson). You can add the flag --write-region-files to ocr run to add this optional step in the pipeline.

ocr write-aggregated-region-analysis-files --env-file .env

Troubleshooting

Common Issues

Environment configuration issues

Error: Missing required environment variables

Solutions:

  • Verify .env file exists and is properly formatted
  • Check all required AWS credentials are set
  • Ensure Coiled credentials are configured (for cloud platform)

Resource and access issues

Local Platform

  • Disk space: Check available space in temp directory
  • Memory: Reduce dataset size or increase system RAM
  • Permissions: Verify file/directory access rights

Coiled Platform

  • Job failures: Check Coiled credentials and account quotas
  • AWS access: Verify IAM permissions and credentials
  • Network: Confirm AWS region and connectivity