OCR Data Pipeline¶
The OCR (Open Climate Risk) data pipeline processes climate risk data through a series of coordinated stages, from individual region processing to final tile generation for visualization.
Overview¶
The pipeline transforms raw climate data into risk assessments through four main stages:
- Region Processing - Calculate risk metrics for individual geographic regions
- Data Aggregation - Combine regional results into consolidated datasets
- Statistical Summaries - Generate county and tract-level statistics (optional)
- Tile Generation - Create PMTiles for web visualization
Getting Started¶
Prerequisites¶
- Python environment with OCR package installed (see installation guide)
- AWS credentials (for data access)
- Coiled account (for cloud execution, optional)
Tutorial: quick end-to-end (local)¶
This tutorial walks you through a short, practical run that processes one region locally and inspects the output.
- Ensure your environment is configured and the package is installed (see installation guide).
- Copy an example env and set a local storage path for quick testing:
cp ocr-local.env .env
# For local testing you can set OCR_STORAGE_ROOT to a local path, e.g. ./output/
- Run a single-region processing job locally:
- Inspect outputs in the storage root (geoparquet files and logs):
- If you set
OCR_DEBUG=1you will see detailed logs printed to stdout.
Tutorial: quick end-to-end (Coiled)¶
Use Coiled for parallel, large-scale processing.
- Ensure Coiled credentials are set by logging into your account via the Coiled CLI.
- Run an example multi-region job on Coiled:
- Monitor the job on Coiled's web UI and check outputs in your
OCR_STORAGE_ROOTbucket.
Basic Usage¶
Process a single region locally:
Process multiple regions on Coiled:
Process all available regions:
Execution Platforms¶
Local Platform¶
Best for: Development, testing, debugging, small datasets
- Runs entirely on your local machine
- Uses local temporary directories
- No cloud costs or dependencies
- Limited by local computational resources
- Sequential processing only
Coiled Platform¶
Best for: Production workloads, large-scale processing, parallel execution
- Runs on AWS cloud infrastructure
- Automatic resource scaling and management
- Parallel job execution across multiple workers
- Optimized VM types for different workloads
- Built-in monitoring and cost tracking
Configuration¶
Environment Setup¶
Create a .env file for your configuration:
# .env file for OCR configuration
# OCR Configuration
OCR_STORAGE_ROOT=s3://your-bucket/
OCR_ENVIRONMENT=QA
Use your configuration file:
Key Configuration Components¶
- Icechunk Store - Version-controlled data storage backend
- Vector Output - Location for processed geoparquet and PMTiles files
- Environment - Data version/environment (prod, QA, etc.)
- Chunking - Defines valid region boundaries and IDs
CLI Commands¶
ocr¶
Run OCR deployment pipeline on Coiled
Usage:
Options:
--install-completion Install completion for the current shell.
--show-completion Show completion for the current shell, to copy it or
customize the installation.
--help Show this message and exit.
Subcommands
- aggregate-region-risk-summary-stats: Generate time-horizon based statistical summaries for county and tract level PMTiles creation
- create-building-pmtiles: Create PMTiles from the consolidated geoparquet file.
- create-pyramid: Create Pyramid
- create-regional-pmtiles: Create PMTiles for regional risk statistics (counties and tracts).
- ingest-data: Ingest and process input datasets
- partition-buildings: Partition buildings geoparquet by state and county FIPS codes.
- process-region: Calculate and write risk for a given region to Icechunk CONUS template.
- run: Run the OCR deployment pipeline. This will process regions, aggregate geoparquet files,
- write-aggregated-region-analysis-files: Write aggregated statistical summaries for each region (county and tract).
ocr aggregate-region-risk-summary-stats¶
Generate time-horizon based statistical summaries for county and tract level PMTiles creation
Usage:
Options:
-e, --env-file PATH Path to the environment variables file. These
will be used to set up the OCRConfiguration
-p, --platform [coiled|local] If set, schedule this command on the
specified platform instead of running inline.
--vm-type TEXT Coiled VM type override (Coiled only).
\[default: c8g.16xlarge]
--help Show this message and exit.
ocr create-building-pmtiles¶
Create PMTiles from the consolidated geoparquet file.
Usage:
Options:
-e, --env-file PATH Path to the environment variables file. These
will be used to set up the OCRConfiguration
-p, --platform [coiled|local] If set, schedule this command on the
specified platform instead of running inline.
--vm-type TEXT Coiled VM type override (Coiled only).
\[default: c8g.8xlarge]
--disk-size INTEGER Disk size in GB (Coiled only). \[default:
250]
--help Show this message and exit.
ocr create-pyramid¶
Create Pyramid
Usage:
Options:
-e, --env-file PATH Path to the environment variables file. These
will be used to set up the OCRConfiguration
-p, --platform [coiled|local] If set, schedule this command on the
specified platform instead of running inline.
--vm-type TEXT Coiled VM type override (Coiled only).
\[default: m8g.16xlarge]
--help Show this message and exit.
ocr create-regional-pmtiles¶
Create PMTiles for regional risk statistics (counties and tracts).
Usage:
Options:
-e, --env-file PATH Path to the environment variables file. These
will be used to set up the OCRConfiguration
-p, --platform [coiled|local] If set, schedule this command on the
specified platform instead of running inline.
--vm-type TEXT Coiled VM type override (Coiled only).
\[default: c8g.8xlarge]
--disk-size INTEGER Disk size in GB (Coiled only). \[default:
250]
--help Show this message and exit.
ocr ingest-data¶
Ingest and process input datasets
Usage:
Options:
Subcommands
- download: Download raw source data for a dataset.
- list-datasets: List all available datasets that can be ingested.
- process: Process downloaded data and upload to S3/Icechunk.
- run-all: Run the complete pipeline: download, process, and cleanup.
ocr ingest-data download¶
Download raw source data for a dataset.
Usage:
Options:
DATASET Name of the dataset to download \[required]
--dry-run Preview operations without executing
--debug Enable debug logging
--help Show this message and exit.
ocr ingest-data list-datasets¶
List all available datasets that can be ingested.
Usage:
Options:
ocr ingest-data process¶
Process downloaded data and upload to S3/Icechunk.
Usage:
Options:
DATASET Name of the dataset to process \[required]
--dry-run Preview operations without executing
--use-coiled Use Coiled for distributed processing
--software TEXT Software environment to use (required if
--use-coiled is set)
--debug Enable debug logging
--overture-data-type TEXT For overture-maps: which data to process
(buildings, addresses, or both) \[default:
both]
--census-geography-type TEXT For census-tiger: which geography to process
(blocks, tracts, counties, or all) \[default:
all]
--census-subset-states TEXT For census-tiger: subset of states to process
(e.g., California Oregon)
--help Show this message and exit.
ocr ingest-data run-all¶
Run the complete pipeline: download, process, and cleanup.
Usage:
Options:
DATASET Name of the dataset to process \[required]
--dry-run Preview operations without executing
--use-coiled Use Coiled for distributed processing
--debug Enable debug logging
--overture-data-type TEXT For overture-maps: which data to process
(buildings, addresses, or both) \[default:
both]
--census-geography-type TEXT For census-tiger: which geography to process
(blocks, tracts, counties, or all) \[default:
all]
--census-subset-states TEXT For census-tiger: subset of states to process
(e.g., California Oregon)
--help Show this message and exit.
ocr partition-buildings¶
Partition buildings geoparquet by state and county FIPS codes.
Usage:
Options:
-e, --env-file PATH Path to the environment variables file. These
will be used to set up the OCRConfiguration
-p, --platform [coiled|local] If set, schedule this command on the
specified platform instead of running inline.
--vm-type TEXT Coiled VM type override (Coiled only).
\[default: c8g.12xlarge]
--help Show this message and exit.
ocr process-region¶
Calculate and write risk for a given region to Icechunk CONUS template.
Usage:
Options:
-e, --env-file PATH Path to the environment variables file. These
will be used to set up the OCRConfiguration
REGION_ID Region ID to process, e.g., y10_x2
\[required]
-t, --risk-type [fire] Type of risk to calculate \[default: fire]
-p, --platform [coiled|local] If set, schedule this command on the
specified platform instead of running inline.
--vm-type TEXT Coiled VM type override (Coiled only).
--init-repo Initialize Icechunk repository (if not
already initialized).
--help Show this message and exit.
ocr run¶
Run the OCR deployment pipeline. This will process regions, aggregate geoparquet files, and create PMTiles layers for the specified risk type.
Usage:
Options:
-e, --env-file PATH Path to the environment variables file.
These will be used to set up the
OCRConfiguration
-r, --region-id TEXT Region IDs to process, e.g., y10_x2
--all-region-ids Process all valid region IDs
-t, --risk-type [fire] Type of risk to calculate \[default: fire]
--write-regional-stats Write aggregated statistical summaries for
each region (one file per region type with
stats like averages, medians, percentiles,
and histograms)
--create-pyramid Create ndpyramid / multiscale zarr for web-
visualization
-p, --platform [coiled|local] Platform to run the pipeline on \[default:
local]
--wipe Wipe the icechunk and vector data storages
before running the pipeline
--dispatch-platform [coiled|local]
If set, schedule this run command on the
specified platform instead of running
inline.
--vm-type TEXT VM type override for dispatch-platform
(Coiled only).
--process-retries INTEGER RANGE
Number of times to retry failed process-
region tasks (Coiled only). 0 disables
retries. \[default: 2; x>=0]
--help Show this message and exit.
ocr write-aggregated-region-analysis-files¶
Write aggregated statistical summaries for each region (county and tract).
Creates one file per region type containing aggregated statistics for ALL regions, including building counts, average/median risk values, percentiles (p90, p95, p99), and histograms. Outputs in geoparquet, geojson, and csv formats.
Usage:
Options:
-e, --env-file PATH Path to the environment variables file. These
will be used to set up the OCRConfiguration
-p, --platform [coiled|local] If set, schedule this command on the
specified platform instead of running inline.
--vm-type TEXT Coiled VM type override (Coiled only).
\[default: r8g.4xlarge]
--help Show this message and exit.
Pipeline Orchestration¶
ocr run - Full Pipeline¶
The main command that orchestrates the complete processing pipeline.
Key Options:
--region-id- Process specific regions (can specify multiple)--all-region-ids- Process all available regions--platform- Chooselocalorcoiledexecution--risk-type- Calculatefireorwindrisk (default: fire)--write-region-files- Write regional aggregated summary stats geospatial files.
Examples:
# Development workflow
ocr run --region-id y10_x2 --platform local
# Production processing with statistics
ocr run --all-region-ids --platform coiled --env-file prod.env
# Multi-region wind risk analysis
ocr run --region-id y10_x2 --region-id y11_x3 --risk-type wind --platform coiled
Individual Stage Commands¶
ocr process-region - Single Region Processing¶
Process risk calculations for one specific region.
# Process fire risk for region y10_x2
ocr process-region y10_x2 --risk-type fire
# Process with custom environment
ocr process-region y15_x7 --env-file production.env --risk-type wind
ocr partition-buildings - Data Consolidation¶
Partition processed geoparquet files by state and county FIPS codes.
ocr aggregate-region-risk-summary-stats - Statistical Summaries¶
Generate county and tract-level risk statistics.
ocr create-regional-pmtiles - Regional Tiles¶
Create PMTiles for county and tract-level visualizations.
ocr create-building-pmtiles - Building PMTiles¶
Generate PMTiles from the consolidated building dataset.
ocr write-aggregated-region-analysis-files - Write Analysis Files¶
Write aggregated region analysis files (csv, geoparquet and geojson).
You can add the flag --write-region-files to ocr run to add this optional step in the pipeline.
Troubleshooting¶
Common Issues¶
Environment configuration issues¶
Solutions:
- Verify
.envfile exists and is properly formatted - Check all required AWS credentials are set
- Ensure Coiled credentials are configured (for cloud platform)
Resource and access issues¶
Local Platform¶
- Disk space: Check available space in temp directory
- Memory: Reduce dataset size or increase system RAM
- Permissions: Verify file/directory access rights
Coiled Platform¶
- Job failures: Check Coiled credentials and account quotas
- AWS access: Verify IAM permissions and credentials
- Network: Confirm AWS region and connectivity