Skip to content

API reference

This page provides a structured, auto-generated reference for the OCR Python package using mkdocstrings. Each section links to the corresponding module(s) and surfaces docstrings, type hints, and signatures.


Package overview

High-level package entry points and public exports.

ocr


Core modules

Configuration

Configuration models for storage, chunking, Coiled, and processing settings.

ocr.config

Classes

ChunkingConfig

Bases: BaseSettings

Attributes
chunk_info cached property
chunk_info: dict

Get information about the dataset's chunks

extent_as_tuple_5070 cached property
extent_as_tuple_5070

Get extent in EPSG:5070 projection as tuple (xmin, xmax, ymin, ymax)

valid_region_ids cached property
valid_region_ids: list

Generate valid region IDs by checking which regions contain non-null data.

Returns:

  • list

    List of valid region IDs (e.g., 'y1_x3', 'y2_x4', etc.)

Functions
bbox_from_wgs84
bbox_from_wgs84(xmin: float, ymin: float, xmax: float, ymax: float)
chunk_id_to_slice
chunk_id_to_slice(chunk_id: tuple) -> tuple

Convert a chunk ID (iy, ix) to corresponding array slices

Parameters:

  • chunk_id (tuple) –

    The chunk identifier as a tuple (iy, ix) where: - iy is the index along y-dimension - ix is the index along x-dimension

Returns:

  • chunk_slices ( tuple[slice] ) –

    A tuple of slices (y_slice, x_slice) to extract data for this chunk

chunks_to_slices
chunks_to_slices(chunks: dict) -> dict

Create a dict of chunk_ids and slices from input chunk dict

Parameters:

  • chunks (dict) –

    Dictionary with chunk sizes for 'longitude' and 'latitude'

Returns:

  • dict

    Dictionary with chunk IDs as keys and corresponding slices as values

get_chunk_mapping
get_chunk_mapping() -> dict[str, tuple[int, int]]

Returns a dict of region_ids and their corresponding chunk_indexes.

Returns:

  • chunk_mapping ( dict ) –

    Dictionary with region IDs as keys and corresponding chunk indexes (iy, ix) as values

get_chunks_for_bbox
get_chunks_for_bbox(bbox: Polygon | tuple) -> list[tuple[int, int]]

Find all chunks that intersect with the given bounding box

Parameters:

  • bbox (BoundingBox or tuple) –

    Bounding box to check for intersection. If tuple, format is (minx, miny, maxx, maxy)

Returns:

  • list of tuples

    List of (iy, ix) tuples identifying the intersecting chunks

index_to_coords
index_to_coords(x_idx: int, y_idx: int) -> tuple[float, float]

Convert array indices to EPSG:4326 coordinates

Parameters:

  • x_idx (int) –

    Index along the x-dimension (longitude)

  • y_idx (int) –

    Index along the y-dimension (latitude)

Returns:

  • x, y : tuple[float, float]

    Corresponding EPSG:4326 coordinates (longitude, latitude)

plot_all_chunks
plot_all_chunks(color_by_size: bool = False) -> None

Plot all data chunks across the entire CONUS with their indices as labels

Parameters:

  • color_by_size (bool, default: False ) –

    If True, color chunks based on their size (useful to identify irregularities)

region_id_chunk_lookup
region_id_chunk_lookup(region_id: str) -> tuple

given a region_id, ex: 'y5_x14, returns the corresponding chunk (5, 14)

Parameters:

  • region_id (str) –

    The region_id for chunk_id lookup.

Returns:

  • index ( tuple[int, int] ) –

    The corresponding chunk (iy, ix) for the given region_id.

region_id_slice_lookup
region_id_slice_lookup(region_id: str) -> tuple

given a region_id, ex: 'y5_x14, returns the corresponding x,y slices. ex: (slice(np.int64(30000), np.int64(36000), None), slice(np.int64(85500), np.int64(90000), None))

Parameters:

  • region_id (str) –

    The region_id for chunk_id lookup.

Returns:

  • indexer ( tuple[slice] ) –

    The corresponding slices (y_slice, x_slice) for the given region_id.

region_id_to_latlon_slices
region_id_to_latlon_slices(region_id: str) -> tuple

Get latitude and longitude slices from region_id

Returns (lat_slice, lon_slice) where lat_slice.start < lat_slice.stop and lon_slice.start < lon_slice.stop (lower-left origin, lat ascending).

visualize_chunks_on_conus
visualize_chunks_on_conus(
    chunks: list[tuple[int, int]] | None = None,
    color_by_size: bool = False,
    highlight_chunks: list[tuple[int, int]] | None = None,
    include_all_chunks: bool = False,
) -> None

Visualize specified chunks on CONUS map

Parameters:

  • chunks (list of tuples, default: None ) –

    List of (iy, ix) tuples specifying chunks to visualize If None, will show all chunks

  • color_by_size (bool, default: False ) –

    If True, color chunks based on their size

  • highlight_chunks (list of tuples, default: None ) –

    List of (iy, ix) tuples specifying chunks to highlight

  • include_all_chunks (bool, default: False ) –

    If True, show all chunks in background with low opacity

IcechunkConfig

Bases: BaseSettings

Configuration for icechunk processing.

Attributes
uri cached property
uri: UPath

Return the URI for the icechunk repository.

Functions
commit_messages_ancestry
commit_messages_ancestry(branch: str = 'main') -> list[str]

Get the commit messages ancestry for the icechunk repository.

create_template
create_template()

Create a template dataset for icechunk store

delete
delete()

Delete the icechunk repository.

init_repo
init_repo()

Creates an icechunk repo or opens if does not exist

insert_region_uncooperative
insert_region_uncooperative(
    subset_ds: Dataset, *, region_id: str, branch: str = 'main'
)

Insert region into Icechunk store

Parameters:

  • subset_ds (Dataset) –

    The subset dataset to insert into the Icechunk store.

  • region_id (str) –

    The region ID corresponding to the subset dataset.

  • branch (str, default: 'main' ) –

    The branch to use in the Icechunk repository, by default 'main'.

model_post_init
model_post_init(__context)

Post-initialization to set up prefixes and URIs based on environment.

pretty_paths
pretty_paths() -> None

Pretty print key IcechunkConfig paths and URIs.

This version touches cached properties (e.g., uri, storage) to surface real configuration and types.

processed_regions
processed_regions(*, branch: str = 'main') -> list[str]

Get a list of region IDs that have already been processed.

repo_and_session
repo_and_session(readonly: bool = False, branch: str = 'main') -> dict

Open an icechunk repository and return the session.

wipe
wipe()

Wipe the icechunk repository.

OCRConfig

Bases: BaseSettings

Configuration settings for OCR processing.

Functions
pretty_paths
pretty_paths() -> None

Pretty print key OCRConfig paths and URIs.

This method intentionally touches cached properties that create directories (e.g., via mkdir) so you can verify real locations.

resolve_region_ids
resolve_region_ids(
    provided_region_ids: set[str], *, allow_all_processed: bool = False
) -> RegionIDStatus

Validate provided region IDs against valid + processed sets.

Parameters:

  • provided_region_ids (set[str]) –

    The set of region IDs to validate.

  • allow_all_processed (bool, default: False ) –

    If True, don't raise an error when all regions are already processed. This is useful for production reruns where you want to regenerate vector outputs even if icechunk regions are complete. Default is False.

Returns:

  • RegionIDStatus

    Status object with validation results.

Raises:

  • ValueError

    If no valid unprocessed region IDs remain and allow_all_processed is False.

select_region_ids
select_region_ids(
    region_ids: list[str] | None,
    *,
    all_region_ids: bool = False,
    allow_all_processed: bool = False,
) -> RegionIDStatus

Helper to pick the effective set of region IDs (all or user-provided) and return the validated status object.

Parameters:

  • region_ids (list[str] | None) –

    User-provided region IDs to process.

  • all_region_ids (bool, default: False ) –

    If True, use all valid region IDs instead of user-provided ones. Default is False.

  • allow_all_processed (bool, default: False ) –

    If True, don't raise an error when all regions are already processed. Passed through to resolve_region_ids. Default is False.

Returns:

  • RegionIDStatus

    Status object with validation results.

PyramidConfig

Bases: BaseSettings

Configuration for visualization pyramid / multiscales

Functions
model_post_init
model_post_init(__context)

Post-initialization to set up prefixes and URIs based on environment.

wipe
wipe()

Wipe the pyramid data storage.

VectorConfig

Bases: BaseSettings

Configuration for vector data processing.

Attributes
block_summary_stats_uri cached property
block_summary_stats_uri: UPath

URI for the block summary statistics file.

counties_summary_stats_uri cached property
counties_summary_stats_uri: UPath

URI for the counties summary statistics file.

tracts_summary_stats_uri cached property
tracts_summary_stats_uri: UPath

URI for the tracts summary statistics file.

Functions
model_post_init
model_post_init(__context)

Post-initialization to set up prefixes and URIs based on environment.

pretty_paths
pretty_paths() -> None

Pretty print key VectorConfig paths and URIs.

This method intentionally touches cached properties that create directories (e.g., via mkdir) so you can verify real locations.

upath_delete
upath_delete(path: UPath) -> None

Use UPath to handle deletion in a cloud-agnostic way

wipe
wipe()

Wipe the vector data storage.

Functions

load_config

load_config(file_path: Path | None) -> OCRConfig

Load OCR configuration from an env file (dotenv) or current environment.

Type definitions

Strongly typed enums for environment, platform, and risk types.

ocr.types

Classes

RiskType

Bases: str, Enum

Available risk types for calculation.


Data access

Datasets

Dataset and Catalog abstractions for Zarr and GeoParquet on S3/local storage.

ocr.datasets

Classes

Catalog

Bases: BaseModel

Base class for datasets catalog.

Functions
__repr__
__repr__() -> str

Return a string representation of the catalog.

__str__
__str__() -> str

Return a string representation of the catalog.

get_dataset
get_dataset(
    name: str,
    version: str | None = None,
    *,
    case_sensitive: bool = True,
    latest: bool = False,
) -> Dataset

Get a dataset by name and optionally version.

Parameters:

  • name (str) –

    Name of the dataset to retrieve

  • version (str, default: None ) –

    Specific version of the dataset. If not provided, returns the dataset if only one version exists, or raises an error if multiple versions exist, unless get_latest=True.

  • case_sensitive (bool, default: True ) –

    Whether to match dataset names case-sensitively

  • latest (bool, default: False ) –

    If True and version=None, returns the latest version instead of raising an error when multiple versions exist

Returns:

Raises:

  • ValueError

    If multiple versions exist and version is not specified (and latest=False)

  • KeyError

    If no matching dataset is found

Examples:

>>> # Get a dataset with a specific version
>>> catalog.get_dataset('conus-overture-buildings', 'v2025-03-19.1')
>>>
>>> # Get latest version of a dataset
>>> catalog.get_dataset('conus-overture-buildings', get_latest=True)

Dataset

Bases: BaseModel

Base class for datasets.

Functions
query_geoparquet
query_geoparquet(
    query: str | None = None, *, install_extensions: bool = True
) -> DuckDBPyRelation

Query a geoparquet file using DuckDB.

Parameters:

  • query (str, default: None ) –

    SQL query to execute. If not provided, returns all data.

  • install_extensions (bool, default: True ) –

    Whether to install and load the spatial and httpfs extensions.

Returns:

  • DuckDBPyRelation

    Result of the DuckDB query.

Raises:

  • ValueError

    If dataset is not in 'geoparquet' format.

Example

Example of querying buildings with a converted geometry column:

buildings = catalog.get_dataset('conus-overture-buildings', 'v2025-03-19.1') result = buildings.query_geoparquet(""" ... SELECT ... id, ... roof_material, ... geometry ... FROM read_parquet('{s3_path}') ... WHERE roof_material = 'concrete' ... """)

Then convert to GeoDataFrame

gdf = buildings.to_geopandas(""" ... SELECT ... id, ... roof_material, ... geometry ... FROM read_parquet('{s3_path}') ... WHERE roof_material = 'concrete' ... """)

to_geopandas
to_geopandas(
    query: str | None = None,
    geometry_column='geometry',
    crs: str = 'EPSG:4326',
    target_crs: str | None = None,
    **kwargs,
) -> GeoDataFrame

Convert query results to a GeoPandas GeoDataFrame.

Parameters:

  • query (str, default: None ) –

    SQL query to execute. If not provided, returns all data.

  • geometry_column (str, default: 'geometry' ) –

    The name of the geometry column in the query result.

  • crs (str, default: 'EPSG:4326' ) –

    The coordinate reference system to use for the geometries.

  • target_crs (str, default: None ) –

    The target coordinate reference system to convert the geometries to.

  • **kwargs (dict, default: {} ) –

    Additional keyword arguments passed to query_geoparquet.

Returns:

  • GeoDataFrame

    A GeoPandas GeoDataFrame containing the queried data with geometries.

Raises:

  • ValueError

    If dataset is not in 'geoparquet' format or if the geometry column is not found.

Example

Example of converting buildings to GeoPandas GeoDataFrame - no need for ST_AsText():

buildings = catalog.get_dataset('conus-overture-buildings', 'v2025-03-19.1') gdf = buildings.to_geopandas(""" ... SELECT ... id, ... roof_material, ... geometry ... FROM read_parquet('{s3_path}') ... WHERE roof_material = 'concrete' ... """) gdf.head()

to_xarray
to_xarray(
    *,
    is_icechunk: bool | None = None,
    xarray_open_kwargs: dict | None = None,
    xarray_storage_options: dict | None = None,
) -> Dataset

Convert the dataset to an xarray.Dataset.

Parameters:

  • is_icechunk (bool | None, default: None ) –

    Whether to use icechunk to access the data. - If True: only try using icechunk - If None: try icechunk first, fall back to direct S3 access if it fails - If False: only use direct S3 access

  • xarray_open_kwargs (dict, default: None ) –

    Additional keyword arguments to pass to xarray.open_dataset.

  • xarray_storage_options (dict, default: None ) –

    Storage options for S3 access when not using icechunk.

Returns:

Raises:

CONUS404 helpers

Load CONUS404 variables, compute relative humidity, wind rotation and diagnostics. Geographic selection utilities (point/bbox) with CRS-aware transforms.

ocr.conus404

Functions

compute_relative_humidity

compute_relative_humidity(ds: Dataset) -> DataArray

Compute relative humidity from specific humidity, temperature, and pressure.

Parameters:

  • ds (Dataset) –

    Input dataset containing 'Q2' (specific humidity), 'T2' (temperature in K), and 'PSFC' (pressure in Pa).

Returns:

  • hurs ( DataArray ) –

    Relative humidity as a percentage.

compute_wind_speed_and_direction

compute_wind_speed_and_direction(u10: DataArray, v10: DataArray) -> Dataset

Derive hourly wind speed (m/s) and direction (degrees from) using xclim.

Parameters:

  • u10 (DataArray) –

    U component of wind at 10 m (m/s).

  • v10 (DataArray) –

    V component of wind at 10 m (m/s).

Returns:

  • wind_ds ( Dataset ) –

    Dataset containing wind speed ('sfcWind') and wind direction ('sfcWindfromdir').

load_conus404

load_conus404(add_spatial_constants: bool = True) -> Dataset

Load the CONUS 404 dataset.

Parameters:

  • add_spatial_constants (bool, default: True ) –

    If True, adds spatial constant variables (SINALPHA, COSALPHA) to the dataset.

Returns:

  • ds ( Dataset ) –

    The CONUS 404 dataset.

rotate_winds_to_earth

rotate_winds_to_earth(ds: Dataset) -> tuple[DataArray, DataArray]

Rotate grid-relative 10 m winds (U10,V10) to earth-relative components. Uses SINALPHA / COSALPHA convention from WRF.

Parameters:

  • ds (Dataset) –

    Input dataset containing 'U10', 'V10', 'SINALPHA', and 'COSALPHA'.

Returns:

  • earth_u ( DataArray ) –

    Earth-relative U component of wind at 10 m.

  • earth_v ( DataArray ) –

    Earth-relative V component of wind at 10 m.


Utilities

General utilities

Helpers for DuckDB (extension loading, S3 secrets), vector sampling, and file transfer.

ocr.utils

Functions

apply_s3_creds

apply_s3_creds(region: str = 'us-west-2', *, con: Any | None = None) -> None

Register AWS credentials as a DuckDB SECRET on the given connection.

Parameters:

  • region (str, default: 'us-west-2' ) –

    AWS region used for S3 access.

  • con (DuckDBPyConnection | None, default: None ) –

    Connection to apply credentials to. If None, uses duckdb's default connection (duckdb.sql), preserving prior behavior.

bbox_tuple_from_xarray_extent

bbox_tuple_from_xarray_extent(
    ds: Dataset, x_name: str = 'x', y_name: str = 'y'
) -> tuple[float, float, float, float]

Creates a bounding box from an Xarray Dataset extent.

Parameters:

  • ds (Dataset) –

    Input Xarray Dataset

  • x_name (str, default: 'x' ) –

    Name of x coordinate, by default 'x'

  • y_name (str, default: 'y' ) –

    Name of y coordinate, by default 'y'

Returns:

  • tuple

    Bounding box tuple in the form: (x_min, y_min, x_max, y_max)

copy_or_upload

copy_or_upload(
    src: UPath,
    dest: UPath,
    overwrite: bool = True,
    chunk_size: int = 16 * 1024 * 1024,
) -> None

Copy a single file from src to dest using UPath/fsspec. - Uses server-side copy if available on the same filesystem (e.g., s3->s3). - Falls back to streaming copy otherwise. - Creates destination parent directories when supported.

Parameters:

  • src (UPath) –

    Source UPath

  • dest (UPath) –

    Destination UPath (file path; if pointing to a directory-like path, src.name is appended)

  • overwrite (bool, default: True ) –

    If False, raises if dest exists

  • chunk_size (int, default: 16 * 1024 * 1024 ) –

    Buffer size for streaming copies

Returns:

  • None

extract_points

extract_points(gdf: GeoDataFrame, da: DataArray) -> DataArray

Extract/sample points from a GeoDataFrame to an Xarray DataArray.

Parameters:

  • gdf (GeoDataFrame) –

    Input geopandas GeoDataFrame. Geometry should be points

  • da (DataArray) –

    Input Xarray DataArray

Returns:

  • DataArray

    DataArray with geometry sampled

Notes

UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

The relatively small size of a building footprint should account for a very small shift in the centroid when calculating from EPSG:4326 vs EPSG:5070.

TODO: Should/can this be a DataArray for typing

geo_sel

geo_sel(
    ds: Dataset,
    *,
    lon: float | None = None,
    lat: float | None = None,
    bbox: tuple[float, float, float, float] | None = None,
    method: str = 'nearest',
    tolerance: float | None = None,
    crs_wkt: str | None = None,
)

Geographic selection helper.

Exactly one of: - (lon AND lat) - (lons AND lats) - bbox=(west, south, east, north)

Parameters:

  • ds (Dataset) –

    Input dataset with x, y coordinates and a valid 'crs' variable with WKT

  • lon (float, default: None ) –

    Longitude of point to select, by default None

  • lat (float, default: None ) –

    Latitude of point to select, by default None

  • bbox (tuple, default: None ) –

    Bounding box to select (west, south, east, north), by default None

  • method (str, default: 'nearest' ) –

    Method to use for point selection, by default 'nearest'

  • tolerance (float, default: None ) –

    Tolerance (in units of the dataset's CRS) for point selection, by default None

  • crs_wkt (str, default: None ) –

    WKT string for the dataset's CRS. If None, attempts to read from ds.crs.attrs['crs_wkt'].

Returns:

  • Dataset

    Single point: time dimension only Multiple points: adds 'point' dimension BBox: retains y, x subset

get_temp_dir

get_temp_dir() -> Path | None

Get optimal temporary directory path for the current environment.

Returns the current working directory if running in /scratch (e.g., on Coiled clusters), otherwise returns None to use the system default temp directory.

On Coiled clusters, /scratch is bind-mounted directly to the NVMe disk, avoiding Docker overlay filesystem overhead and providing better I/O performance and more available space compared to /tmp which sits on the Docker overlay.

Returns:

  • Path | None

    Current working directory if in /scratch, None otherwise (uses system default).

Examples:

>>> import tempfile
>>> from ocr.utils import get_temp_dir
>>> with tempfile.TemporaryDirectory(dir=get_temp_dir()) as tmpdir:
...     # tmpdir will be in /scratch on Coiled, system temp otherwise
...     pass

install_load_extensions

install_load_extensions(
    aws: bool = True,
    spatial: bool = True,
    httpfs: bool = True,
    con: Any | None = None,
) -> None

Installs and applies duckdb extensions.

Parameters:

  • aws (bool, default: True ) –

    Install and load AWS extension, by default True

  • spatial (bool, default: True ) –

    Install and load SPATIAL extension, by default True

  • httpfs (bool, default: True ) –

    Install and load HTTPFS extension, by default True

  • con (DuckDBPyConnection | None, default: None ) –

    Connection to apply extensions to. If None, uses duckdb's default

Testing utilities

Snapshot testing extensions for xarray and GeoPandas.

ocr.testing

Classes

GeoDataFrameSnapshotExtension

Bases: SingleFileSnapshotExtension

Snapshot extension for GeoPandas GeoDataFrames stored as parquet.

Supports both local and remote (S3) storage via environment variable configuration: - SNAPSHOT_STORAGE_PATH: Base path for snapshots (local or s3://bucket/path) Default: s3://carbonplan-scratch/snapshots (configured in tests/conftest.py)

Examples: # Use default S3 storage (no env var needed) pytest tests/test_snapshot.py --snapshot-update

# Override with local storage
SNAPSHOT_STORAGE_PATH=tests/__snapshots__ pytest tests/

# Override with different S3 bucket
SNAPSHOT_STORAGE_PATH=s3://my-bucket/snapshots pytest tests/
Functions
diff_lines
diff_lines(serialized_data: Any, snapshot_data: Any) -> Iterator[str]

Generate diff lines for test output.

dirname classmethod
dirname(*, test_location: PyTestLocation) -> str

Return the directory for storing snapshots.

get_location classmethod
get_location(*, test_location: PyTestLocation, index: SnapshotIndex = 0) -> str

Get the full snapshot location path.

Override to properly handle S3 paths using upath instead of os.path.join.

get_snapshot_name classmethod
get_snapshot_name(
    *, test_location: PyTestLocation, index: SnapshotIndex = 0
) -> str

Generate snapshot name based on test name.

Sanitizes the test name to replace problematic characters (e.g., brackets from parametrized tests) with underscores for valid file paths.

matches
matches(*, serialized_data: Any, snapshot_data: Any) -> bool

Check if serialized data matches snapshot using GeoDataFrame comparison.

read_snapshot_data_from_location
read_snapshot_data_from_location(
    *, snapshot_location: str, snapshot_name: str, session_id: str
) -> GeoDataFrame | None

Read parquet snapshot from disk.

serialize
serialize(data: SerializableData, **kwargs: Any) -> Any

Validate that data is a GeoDataFrame. Returns the data unchanged.

write_snapshot_collection classmethod
write_snapshot_collection(*, snapshot_collection: SnapshotCollection) -> None

Write snapshot collection to parquet format (local or remote).

XarraySnapshotExtension

Bases: SingleFileSnapshotExtension

Snapshot extension for xarray DataArrays and Datasets stored as zarr.

Supports both local and remote (S3) storage via environment variable configuration: - SNAPSHOT_STORAGE_PATH: Base path for snapshots (local or s3://bucket/path) Default: s3://carbonplan-scratch/snapshots (configured in tests/conftest.py)

Examples: # Use default S3 storage (no env var needed) pytest tests/test_snapshot.py --snapshot-update

# Override with local storage
SNAPSHOT_STORAGE_PATH=tests/__snapshots__ pytest tests/

# Override with different S3 bucket
SNAPSHOT_STORAGE_PATH=s3://my-bucket/snapshots pytest tests/
Functions
diff_lines
diff_lines(serialized_data: Any, snapshot_data: Any) -> Iterator[str]

Generate diff lines for test output.

dirname classmethod
dirname(*, test_location: PyTestLocation) -> str

Return the directory for storing snapshots.

get_location classmethod
get_location(*, test_location: PyTestLocation, index: SnapshotIndex = 0) -> str

Get the full snapshot location path.

Override to properly handle S3 paths using upath instead of os.path.join.

get_snapshot_name classmethod
get_snapshot_name(
    *, test_location: PyTestLocation, index: SnapshotIndex = 0
) -> str

Generate snapshot name based on test name.

Sanitizes the test name to replace problematic characters (e.g., brackets from parametrized tests) with underscores for valid file paths.

matches
matches(*, serialized_data: Any, snapshot_data: Any) -> bool

Check if serialized data matches snapshot using approximate comparison.

Uses assert_allclose instead of assert_equal to handle platform-specific numerical differences from OpenCV and scipy operations between macOS and Linux.

read_snapshot_data_from_location
read_snapshot_data_from_location(
    *, snapshot_location: str, snapshot_name: str, session_id: str
) -> Dataset | None

Read zarr snapshot from disk.

serialize
serialize(data: SerializableData, **kwargs: Any) -> Any

Convert DataArray to Dataset for consistent zarr storage. Returns the data unchanged.

write_snapshot_collection classmethod
write_snapshot_collection(*, snapshot_collection: SnapshotCollection) -> None

Write snapshot collection to zarr format (local or remote).


Risk analysis

Fire risk

Core fire/wind risk utilities used by the pipeline (kernels, wind classification, risk composition).

ocr.risks.fire

Functions

apply_wind_directional_convolution

apply_wind_directional_convolution(
    da: DataArray,
    iterations: int = 3,
    kernel_size: float = 81.0,
    circle_diameter: float = 35.0,
) -> Dataset

Apply a directional convolution to a DataArray.

Parameters:

  • da (DataArray) –

    The DataArray to apply the convolution to.

  • iterations (int, default: 3 ) –

    The number of iterations to apply the convolution, by default 3

  • kernel_size (float, default: 81.0 ) –

    The size of the kernel, by default 81.0

  • circle_diameter (float, default: 35.0 ) –

    The diameter of the circle, by default 35.0

Returns:

  • ds ( Dataset ) –

    The Dataset with the directional convolution applied

calculate_wind_adjusted_risk

calculate_wind_adjusted_risk(
    *, x_slice: slice, y_slice: slice, buffer: float = 0.15
) -> Dataset

Calculate wind-adjusted fire risk using climate run and wildfire risk datasets.

Parameters:

  • x_slice (slice) –

    Slice object for selecting longitude range.

  • y_slice (slice) –

    Slice object for selecting latitude range.

  • buffer (float, default: 0.15 ) –

    Buffer size in degrees to add around the region for edge effect handling (default 0.15). For 30m EPSG:4326 data, 0.15 degrees ≈ 16.7 km ≈ 540 pixels. This buffer ensures neighborhood operations (convolution, Gaussian smoothing) have adequate context at boundaries.

Returns:

  • fire_risk ( Dataset ) –

    Dataset containing wind-adjusted fire risk variables.

classify_wind_directions

classify_wind_directions(wind_direction_ds: DataArray) -> DataArray

Classify wind directions into 8 cardinal directions (0-7). The classification is:

0: North (337.5-22.5) 1: Northeast (22.5-67.5) 2: East (67.5-112.5) 3: Southeast (112.5-157.5) 4: South (157.5-202.5) 5: Southwest (202.5-247.5) 6: West (247.5-292.5) 7: Northwest (292.5-337.5)

Parameters:

  • wind_direction_ds (DataArray) –

    DataArray containing wind direction in degrees (0-360)

Returns:

  • result ( DataArray ) –

    DataArray with wind directions classified as integers 0-7

compute_modal_wind_direction

compute_modal_wind_direction(distribution: DataArray) -> Dataset

Compute the modal wind direction from the wind direction distribution.

Parameters:

  • distribution (DataArray) –

    Wind direction distribution.

Returns:

  • mode ( Dataset ) –

    Modal wind direction.

compute_wind_direction_distribution

compute_wind_direction_distribution(
    direction: DataArray, fire_weather_mask: DataArray
) -> Dataset

Compute the wind direction distribution during fire weather conditions.

Parameters:

  • direction (DataArray) –

    Wind direction in degrees (0-360).

  • fire_weather_mask (DataArray) –

    Boolean mask indicating fire weather conditions.

Returns:

  • wind_direction_hist ( Dataset ) –

    Wind direction histogram during fire weather conditions.

create_weighted_composite_bp_map

create_weighted_composite_bp_map(
    bp: Dataset,
    wind_direction_distribution: DataArray,
    *,
    distribution_direction_dim: str = 'wind_direction',
    weight_sum_tolerance: float = 1e-05,
) -> DataArray

Create a weighted composite burn probability map using wind direction distribution.

Parameters:

  • bp (Dataset) –

    Dataset containing 9 directional burn probability layers with variables named ['N','NE','E','SE','S','SW','W','NW','circular'] produced by apply_wind_directional_convolution.

  • wind_direction_distribution (DataArray) –

    Probability distribution over 8 cardinal directions with dimension 'wind_direction' and length 8, matching direction labels: ['N','NE','E','SE','S','SW','W','NW'] (order must align). Values should sum to 1 where fire-weather hours exist; may be all 0 where none exist.

  • distribution_direction_dim (str, default: 'wind_direction' ) –

    Name of the dimension in wind_direction_distribution that holds the direction labels, by default 'wind_direction'.

  • weight_sum_tolerance (float, default: 1e-05 ) –

    Tolerance for deviation from 1.0 in the sum of weights, by default

Returns:

  • weighted ( DataArray ) –

    Weighted composite burn probability with same spatial dims as inputs. Name: 'wind_weighted_bp'. Missing (all-zero) distributions yield NaN.

create_wind_informed_burn_probability

create_wind_informed_burn_probability(
    wind_direction_distribution_30m_4326: DataArray, riley_270m_5070: Dataset
) -> DataArray

Create wind-informed burn probability dataset by applying directional convolution and creating a weighted composite burn probability map.

Parameters:

  • wind_direction_distribution_30m_4326 (DataArray) –

    Wind direction distribution data at 30m resolution in EPSG:4326 projection.

  • riley_270m_5070 (DataArray) –

    Riley et al. (2011) burn probability data at 270m resolution in EPSG:5070 projection.

Returns:

  • smoothed_final_bp ( DataArray ) –

    Smoothed wind-informed burn probability data at 30m resolution in EPSG:4326 projection.

direction_histogram

direction_histogram(data_array: DataArray) -> DataArray

Compute direction histogram on xarray DataArray with dask chunks.

Parameters:

  • data_array (DataArray) –

    Input data array containing direction indices (expected to be integers 0-7)

Returns:

  • DataArray

    Normalized histogram counts as a probability distribution

fosberg_fire_weather_index

fosberg_fire_weather_index(
    hurs: DataArray, T2: DataArray, sfcWind: DataArray
) -> DataArray

Calculate the Fosberg Fire Weather Index (FFWI) based on relative humidity, temperature, and wind speed. taken from wikifire.wsl.ch/tiki-indexb1d5.html?page=Fosberg+fire+weather+index&structure=Fire hurs, T2, sfcWind are arrays

Parameters:

  • hurs (DataArray) –

    Relative humidity in percentage (0-100).

  • T2 (DataArray) –

    Temperature

  • sfcWind (DataArray) –

    Wind speed in meters per second.

Returns:

  • DataArray

    Fosberg Fire Weather Index (FFWI).

generate_weights

generate_weights(
    method: Literal['skewed', 'circular_focal_mean'] = 'skewed',
    kernel_size: float = 81.0,
    circle_diameter: float = 35.0,
) -> ndarray

Generate a 2D array of weights for a circular kernel.

Parameters:

  • method (str, default: 'skewed' ) –

    The method to use for generating weights. Options are 'skewed' or 'circular_focal_mean'. 'skewed' generates an elliptical kernel to simulate wind directionality. 'circular_focal_mean' generates a circular kernel, by default 'skewed'

  • kernel_size (float, default: 81.0 ) –

    The size of the kernel, by default 81.0

  • circle_diameter (float, default: 35.0 ) –

    The diameter of the circle, by default 35.0

Returns:

  • weights ( ndarray ) –

    A 2D array of weights for the circular kernel.

generate_wind_directional_kernels

generate_wind_directional_kernels(
    kernel_size: float = 81.0, circle_diameter: float = 35.0
) -> dict[str, ndarray]

Generate a dictionary of 2D arrays of weights for circular kernels oriented in different directions.

Parameters:

  • kernel_size (float, default: 81.0 ) –

    The size of the kernel, by default 81.0

  • circle_diameter (float, default: 35.0 ) –

    The diameter of the circle, by default 35.0

Returns:

  • kernels ( dict[str, ndarray] ) –

    A dictionary of 2D arrays of weights for circular kernels oriented in different directions.


Internal pipeline modules

Internal API

These modules are used internally by the pipeline and are not intended for direct public consumption. They are documented here for completeness and advanced use cases.

Batch managers

Orchestration backends for local and Coiled execution.

ocr.deploy.managers

Classes

AbstractBatchManager

Bases: BaseModel

Abstract base class for batch managers.

Functions
submit_job
submit_job(command: str, name: str, kwargs: dict[str, Any])

Wait for the batch job to complete.

wait_for_completion
wait_for_completion(exit_on_failure: bool = False)

Get the batch ID.

CoiledBatchManager

Bases: AbstractBatchManager

Coiled batch manager for managing batch jobs.

Functions
submit_job
submit_job(command: str, name: str, kwargs: dict[str, Any]) -> str

Submit a job to Coiled batch.

Parameters:

  • command (str) –

    The command to run.

  • name (str) –

    The name of the job.

  • kwargs (dict) –

    Additional keyword arguments to pass to coiled.batch.run.

Returns:

  • job_id ( str ) –

    The ID of the submitted job.

wait_for_completion
wait_for_completion(exit_on_failure: bool = False)

Wait for all tracked jobs to complete.

Parameters:

  • exit_on_failure (bool, default: False ) –

    If True, raise an Exception immediately when a job failure is detected.

Returns:

  • completed, failed : tuple[set[str], set[str]]

    A tuple of (completed_job_ids, failed_job_ids). If exit_on_failure is True and a failure is encountered the method will raise before returning.

LocalBatchManager

Bases: AbstractBatchManager

Local batch manager for running jobs locally using subprocess.

Functions
__del__
__del__()

Clean up the executor when the manager is destroyed.

model_post_init
model_post_init(__context)

Initialize the thread pool executor after model creation.

submit_job
submit_job(command: str, name: str, kwargs: dict[str, Any]) -> str

Submit a job to run locally.

Parameters:

  • command (str) –

    The command to run.

  • name (str) –

    The name of the job.

  • kwargs (dict) –

    Additional keyword arguments to pass to subprocess.run.

Returns:

  • job_id ( str ) –

    The ID of the submitted job.

wait_for_completion
wait_for_completion(exit_on_failure: bool = False)

Wait for all tracked jobs to complete.

Parameters:

  • exit_on_failure (bool, default: False ) –

    If True, raise an Exception immediately when a job failure is detected.

Returns:

  • completed, failed : tuple[set[str], set[str]]

    A tuple of (completed_job_ids, failed_job_ids). If exit_on_failure is True and a failure is encountered the method will raise before returning.

CLI application

Command-line interface exposed as the ocr command. For detailed usage and options, see the tutorials section.

ocr

Run OCR deployment pipeline on Coiled

Usage:

ocr [OPTIONS] COMMAND [ARGS]...

Options:

  --install-completion  Install completion for the current shell.
  --show-completion     Show completion for the current shell, to copy it or
                        customize the installation.
  --help                Show this message and exit.

Subcommands

ocr aggregate-region-risk-summary-stats

Generate time-horizon based statistical summaries for county and tract level PMTiles creation

Usage:

ocr aggregate-region-risk-summary-stats [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.16xlarge]
  --help                         Show this message and exit.

ocr create-building-pmtiles

Create PMTiles from the consolidated geoparquet file.

Usage:

ocr create-building-pmtiles [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.8xlarge]
  --disk-size INTEGER            Disk size in GB (Coiled only).  \[default:
                                 250]
  --help                         Show this message and exit.

ocr create-pyramid

Create Pyramid

Usage:

ocr create-pyramid [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: m8g.16xlarge]
  --help                         Show this message and exit.

ocr create-regional-pmtiles

Create PMTiles for regional risk statistics (counties and tracts).

Usage:

ocr create-regional-pmtiles [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.8xlarge]
  --disk-size INTEGER            Disk size in GB (Coiled only).  \[default:
                                 250]
  --help                         Show this message and exit.

ocr ingest-data

Ingest and process input datasets

Usage:

ocr ingest-data [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

Subcommands

  • download: Download raw source data for a dataset.
  • list-datasets: List all available datasets that can be ingested.
  • process: Process downloaded data and upload to S3/Icechunk.
  • run-all: Run the complete pipeline: download, process, and cleanup.

ocr ingest-data download

Download raw source data for a dataset.

Usage:

ocr ingest-data download [OPTIONS] DATASET

Options:

  DATASET    Name of the dataset to download  \[required]
  --dry-run  Preview operations without executing
  --debug    Enable debug logging
  --help     Show this message and exit.

ocr ingest-data list-datasets

List all available datasets that can be ingested.

Usage:

ocr ingest-data list-datasets [OPTIONS]

Options:

  --help  Show this message and exit.

ocr ingest-data process

Process downloaded data and upload to S3/Icechunk.

Usage:

ocr ingest-data process [OPTIONS] DATASET

Options:

  DATASET                       Name of the dataset to process  \[required]
  --dry-run                     Preview operations without executing
  --use-coiled                  Use Coiled for distributed processing
  --software TEXT               Software environment to use (required if
                                --use-coiled is set)
  --debug                       Enable debug logging
  --overture-data-type TEXT     For overture-maps: which data to process
                                (buildings, addresses, or both)  \[default:
                                both]
  --census-geography-type TEXT  For census-tiger: which geography to process
                                (blocks, tracts, counties, or all)  \[default:
                                all]
  --census-subset-states TEXT   For census-tiger: subset of states to process
                                (e.g., California Oregon)
  --help                        Show this message and exit.

ocr ingest-data run-all

Run the complete pipeline: download, process, and cleanup.

Usage:

ocr ingest-data run-all [OPTIONS] DATASET

Options:

  DATASET                       Name of the dataset to process  \[required]
  --dry-run                     Preview operations without executing
  --use-coiled                  Use Coiled for distributed processing
  --debug                       Enable debug logging
  --overture-data-type TEXT     For overture-maps: which data to process
                                (buildings, addresses, or both)  \[default:
                                both]
  --census-geography-type TEXT  For census-tiger: which geography to process
                                (blocks, tracts, counties, or all)  \[default:
                                all]
  --census-subset-states TEXT   For census-tiger: subset of states to process
                                (e.g., California Oregon)
  --help                        Show this message and exit.

ocr partition-buildings

Partition buildings geoparquet by state and county FIPS codes.

Usage:

ocr partition-buildings [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.12xlarge]
  --help                         Show this message and exit.

ocr process-region

Calculate and write risk for a given region to Icechunk CONUS template.

Usage:

ocr process-region [OPTIONS] REGION_ID

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  REGION_ID                      Region ID to process, e.g., y10_x2
                                 \[required]
  -t, --risk-type [fire]         Type of risk to calculate  \[default: fire]
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
  --init-repo                    Initialize Icechunk repository (if not
                                 already initialized).
  --help                         Show this message and exit.

ocr run

Run the OCR deployment pipeline. This will process regions, aggregate geoparquet files, and create PMTiles layers for the specified risk type.

Usage:

ocr run [OPTIONS]

Options:

  -e, --env-file PATH             Path to the environment variables file.
                                  These will be used to set up the
                                  OCRConfiguration
  -r, --region-id TEXT            Region IDs to process, e.g., y10_x2
  --all-region-ids                Process all valid region IDs
  -t, --risk-type [fire]          Type of risk to calculate  \[default: fire]
  --write-regional-stats          Write aggregated statistical summaries for
                                  each region (one file per region type with
                                  stats like averages, medians, percentiles,
                                  and histograms)
  --create-pyramid                Create ndpyramid / multiscale zarr for web-
                                  visualization
  -p, --platform [coiled|local]   Platform to run the pipeline on  \[default:
                                  local]
  --wipe                          Wipe the icechunk and vector data storages
                                  before running the pipeline
  --dispatch-platform [coiled|local]
                                  If set, schedule this run command on the
                                  specified platform instead of running
                                  inline.
  --vm-type TEXT                  VM type override for dispatch-platform
                                  (Coiled only).
  --process-retries INTEGER RANGE
                                  Number of times to retry failed process-
                                  region tasks (Coiled only). 0 disables
                                  retries.  \[default: 2; x>=0]
  --help                          Show this message and exit.

ocr write-aggregated-region-analysis-files

Write aggregated statistical summaries for each region (county and tract).

Creates one file per region type containing aggregated statistics for ALL regions, including building counts, average/median risk values, percentiles (p90, p95, p99), and histograms. Outputs in geoparquet, geojson, and csv formats.

Usage:

ocr write-aggregated-region-analysis-files [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: r8g.4xlarge]
  --help                         Show this message and exit.