API reference¶

This page provides a structured, auto-generated reference for the OCR Python package using mkdocstrings. Each section links to the corresponding module(s) and surfaces docstrings, type hints, and signatures.

Package overview¶

High-level package entry points and public exports.

ocr ¶

Core modules¶

Configuration¶

Configuration models for storage, chunking, Coiled, and processing settings.

ocr.config ¶

Classes¶

ChunkingConfig ¶

Bases: BaseSettings

Attributes¶

chunk_info `cached` `property` ¶

chunk_info: dict

Get information about the dataset's chunks

extent_as_tuple_5070 `cached` `property` ¶

extent_as_tuple_5070

Get extent in EPSG:5070 projection as tuple (xmin, xmax, ymin, ymax)

valid_region_ids `cached` `property` ¶

valid_region_ids: list

Generate valid region IDs by checking which regions contain non-null data.

Returns:

list –

List of valid region IDs (e.g., 'y1_x3', 'y2_x4', etc.)

Functions¶

bbox_from_wgs84 ¶

bbox_from_wgs84(xmin: float, ymin: float, xmax: float, ymax: float)

observablehq.com/@rdmurphy/u-s-state-bounding-boxes

chunk_id_to_slice ¶

chunk_id_to_slice(chunk_id: tuple) -> tuple

Convert a chunk ID (iy, ix) to corresponding array slices

Parameters:

chunk_id (tuple) –

The chunk identifier as a tuple (iy, ix) where: - iy is the index along y-dimension - ix is the index along x-dimension

Returns:

chunk_slices ( tuple[slice] ) –

A tuple of slices (y_slice, x_slice) to extract data for this chunk

chunks_to_slices ¶

chunks_to_slices(chunks: dict) -> dict

Create a dict of chunk_ids and slices from input chunk dict

Parameters:

chunks (dict) –

Dictionary with chunk sizes for 'longitude' and 'latitude'

Returns:

dict –

Dictionary with chunk IDs as keys and corresponding slices as values

get_chunk_mapping ¶

get_chunk_mapping() -> dict[str, tuple[int, int]]

Returns a dict of region_ids and their corresponding chunk_indexes.

Returns:

chunk_mapping ( dict ) –

Dictionary with region IDs as keys and corresponding chunk indexes (iy, ix) as values

get_chunks_for_bbox ¶

get_chunks_for_bbox(bbox: Polygon | tuple) -> list[tuple[int, int]]

Find all chunks that intersect with the given bounding box

Parameters:

bbox (BoundingBox or tuple) –

Bounding box to check for intersection. If tuple, format is (minx, miny, maxx, maxy)

Returns:

list of tuples –

List of (iy, ix) tuples identifying the intersecting chunks

index_to_coords ¶

index_to_coords(x_idx: int, y_idx: int) -> tuple[float, float]

Convert array indices to EPSG:4326 coordinates

Parameters:

x_idx (int) –

Index along the x-dimension (longitude)
y_idx (int) –

Index along the y-dimension (latitude)

Returns:

x, y : tuple[float, float] –

Corresponding EPSG:4326 coordinates (longitude, latitude)

plot_all_chunks ¶

plot_all_chunks(color_by_size: bool = False) -> None

Plot all data chunks across the entire CONUS with their indices as labels

Parameters:

color_by_size (bool, default: False ) –

If True, color chunks based on their size (useful to identify irregularities)

region_id_chunk_lookup ¶

region_id_chunk_lookup(region_id: str) -> tuple

given a region_id, ex: 'y5_x14, returns the corresponding chunk (5, 14)

Parameters:

region_id (str) –

The region_id for chunk_id lookup.

Returns:

index ( tuple[int, int] ) –

The corresponding chunk (iy, ix) for the given region_id.

region_id_slice_lookup ¶

region_id_slice_lookup(region_id: str) -> tuple

given a region_id, ex: 'y5_x14, returns the corresponding x,y slices. ex: (slice(np.int64(30000), np.int64(36000), None), slice(np.int64(85500), np.int64(90000), None))

Parameters:

region_id (str) –

The region_id for chunk_id lookup.

Returns:

indexer ( tuple[slice] ) –

The corresponding slices (y_slice, x_slice) for the given region_id.

region_id_to_latlon_slices ¶

region_id_to_latlon_slices(region_id: str) -> tuple

Get latitude and longitude slices from region_id

Returns (lat_slice, lon_slice) where lat_slice.start < lat_slice.stop and lon_slice.start < lon_slice.stop (lower-left origin, lat ascending).

visualize_chunks_on_conus ¶

visualize_chunks_on_conus(
    chunks: list[tuple[int, int]] | None = None,
    color_by_size: bool = False,
    highlight_chunks: list[tuple[int, int]] | None = None,
    include_all_chunks: bool = False,
) -> None

Visualize specified chunks on CONUS map

Parameters:

chunks (list of tuples, default: None ) –

List of (iy, ix) tuples specifying chunks to visualize If None, will show all chunks
color_by_size (bool, default: False ) –

If True, color chunks based on their size
highlight_chunks (list of tuples, default: None ) –

List of (iy, ix) tuples specifying chunks to highlight
include_all_chunks (bool, default: False ) –

If True, show all chunks in background with low opacity

IcechunkConfig ¶

Bases: BaseSettings

Configuration for icechunk processing.

Attributes¶

uri `cached` `property` ¶

uri: UPath

Return the URI for the icechunk repository.

Functions¶

commit_messages_ancestry ¶

commit_messages_ancestry(branch: str = 'main') -> list[str]

Get the commit messages ancestry for the icechunk repository.

create_template ¶

create_template()

Create a template dataset for icechunk store

delete ¶

delete()

Delete the icechunk repository.

init_repo ¶

init_repo()

Creates an icechunk repo or opens if does not exist

insert_region_uncooperative ¶

insert_region_uncooperative(
    subset_ds: Dataset, *, region_id: str, branch: str = 'main'
)

Insert region into Icechunk store

Parameters:

subset_ds (Dataset) –

The subset dataset to insert into the Icechunk store.
region_id (str) –

The region ID corresponding to the subset dataset.
branch (str, default: 'main' ) –

The branch to use in the Icechunk repository, by default 'main'.

model_post_init ¶

model_post_init(__context)

Post-initialization to set up prefixes and URIs based on environment.

pretty_paths ¶

pretty_paths() -> None

Pretty print key IcechunkConfig paths and URIs.

This version touches cached properties (e.g., uri, storage) to surface real configuration and types.

processed_regions ¶

processed_regions(*, branch: str = 'main') -> list[str]

Get a list of region IDs that have already been processed.

repo_and_session ¶

repo_and_session(readonly: bool = False, branch: str = 'main') -> dict

Open an icechunk repository and return the session.

wipe ¶

wipe()

Wipe the icechunk repository.

OCRConfig ¶

Bases: BaseSettings

Configuration settings for OCR processing.

Functions¶

pretty_paths ¶

pretty_paths() -> None

Pretty print key OCRConfig paths and URIs.

This method intentionally touches cached properties that create directories (e.g., via mkdir) so you can verify real locations.

resolve_region_ids ¶

resolve_region_ids(
    provided_region_ids: set[str], *, allow_all_processed: bool = False
) -> RegionIDStatus

Validate provided region IDs against valid + processed sets.

Parameters:

provided_region_ids (set[str]) –

The set of region IDs to validate.
allow_all_processed (bool, default: False ) –

If True, don't raise an error when all regions are already processed. This is useful for production reruns where you want to regenerate vector outputs even if icechunk regions are complete. Default is False.

Returns:

RegionIDStatus –

Status object with validation results.

Raises:

ValueError –

If no valid unprocessed region IDs remain and allow_all_processed is False.

select_region_ids ¶

select_region_ids(
    region_ids: list[str] | None,
    *,
    all_region_ids: bool = False,
    allow_all_processed: bool = False,
) -> RegionIDStatus

Helper to pick the effective set of region IDs (all or user-provided) and return the validated status object.

Parameters:

region_ids (list[str] | None) –

User-provided region IDs to process.
all_region_ids (bool, default: False ) –

If True, use all valid region IDs instead of user-provided ones. Default is False.
allow_all_processed (bool, default: False ) –

If True, don't raise an error when all regions are already processed. Passed through to resolve_region_ids. Default is False.

Returns:

RegionIDStatus –

Status object with validation results.

PyramidConfig ¶

Bases: BaseSettings

Configuration for visualization pyramid / multiscales

Functions¶

model_post_init ¶

model_post_init(__context)

Post-initialization to set up prefixes and URIs based on environment.

wipe ¶

wipe()

Wipe the pyramid data storage.

VectorConfig ¶

Bases: BaseSettings

Configuration for vector data processing.

Attributes¶

block_summary_stats_uri `cached` `property` ¶

block_summary_stats_uri: UPath

URI for the block summary statistics file.

counties_summary_stats_uri `cached` `property` ¶

counties_summary_stats_uri: UPath

URI for the counties summary statistics file.

tracts_summary_stats_uri `cached` `property` ¶

tracts_summary_stats_uri: UPath

URI for the tracts summary statistics file.

Functions¶

model_post_init ¶

model_post_init(__context)

Post-initialization to set up prefixes and URIs based on environment.

pretty_paths ¶

pretty_paths() -> None

Pretty print key VectorConfig paths and URIs.

This method intentionally touches cached properties that create directories (e.g., via mkdir) so you can verify real locations.

upath_delete ¶

upath_delete(path: UPath) -> None

Use UPath to handle deletion in a cloud-agnostic way

wipe ¶

wipe()

Wipe the vector data storage.

Functions¶

load_config ¶

load_config(file_path: Path | None) -> OCRConfig

Load OCR configuration from an env file (dotenv) or current environment.

Type definitions¶

Strongly typed enums for environment, platform, and risk types.

ocr.types ¶

Classes¶

RiskType ¶

Bases: str, Enum

Available risk types for calculation.

Data access¶

Datasets¶

Dataset and Catalog abstractions for Zarr and GeoParquet on S3/local storage.

ocr.datasets ¶

Classes¶

Catalog ¶

Bases: BaseModel

Base class for datasets catalog.

Functions¶

repr ¶

__repr__() -> str

Return a string representation of the catalog.

str ¶

__str__() -> str

Return a string representation of the catalog.

get_dataset ¶

get_dataset(
    name: str,
    version: str | None = None,
    *,
    case_sensitive: bool = True,
    latest: bool = False,
) -> Dataset

Get a dataset by name and optionally version.

Parameters:

name (str) –

Name of the dataset to retrieve
version (str, default: None ) –

Specific version of the dataset. If not provided, returns the dataset if only one version exists, or raises an error if multiple versions exist, unless get_latest=True.
case_sensitive (bool, default: True ) –

Whether to match dataset names case-sensitively
latest (bool, default: False ) –

If True and version=None, returns the latest version instead of raising an error when multiple versions exist

Returns:

Dataset –

The matched dataset

Raises:

ValueError –

If multiple versions exist and version is not specified (and latest=False)
KeyError –

If no matching dataset is found

Examples:

>>> # Get a dataset with a specific version
>>> catalog.get_dataset('conus-overture-buildings', 'v2025-03-19.1')
>>>
>>> # Get latest version of a dataset
>>> catalog.get_dataset('conus-overture-buildings', get_latest=True)

Dataset ¶

Bases: BaseModel

Base class for datasets.

Functions¶

query_geoparquet ¶

query_geoparquet(
    query: str | None = None, *, install_extensions: bool = True
) -> DuckDBPyRelation

Query a geoparquet file using DuckDB.

Parameters:

query (str, default: None ) –

SQL query to execute. If not provided, returns all data.
install_extensions (bool, default: True ) –

Whether to install and load the spatial and httpfs extensions.

Returns:

DuckDBPyRelation –

Result of the DuckDB query.

Raises:

ValueError –

If dataset is not in 'geoparquet' format.

Example

Example of querying buildings with a converted geometry column:

buildings = catalog.get_dataset('conus-overture-buildings', 'v2025-03-19.1') result = buildings.query_geoparquet(""" ... SELECT ... id, ... roof_material, ... geometry ... FROM read_parquet('{s3_path}') ... WHERE roof_material = 'concrete' ... """)

Then convert to GeoDataFrame¶

gdf = buildings.to_geopandas(""" ... SELECT ... id, ... roof_material, ... geometry ... FROM read_parquet('{s3_path}') ... WHERE roof_material = 'concrete' ... """)

to_geopandas ¶

to_geopandas(
    query: str | None = None,
    geometry_column='geometry',
    crs: str = 'EPSG:4326',
    target_crs: str | None = None,
    **kwargs,
) -> GeoDataFrame

Convert query results to a GeoPandas GeoDataFrame.

Parameters:

query (str, default: None ) –

SQL query to execute. If not provided, returns all data.
geometry_column (str, default: 'geometry' ) –

The name of the geometry column in the query result.
crs (str, default: 'EPSG:4326' ) –

The coordinate reference system to use for the geometries.
target_crs (str, default: None ) –

The target coordinate reference system to convert the geometries to.
**kwargs (dict, default: {} ) –

Additional keyword arguments passed to query_geoparquet.

Returns:

GeoDataFrame –

A GeoPandas GeoDataFrame containing the queried data with geometries.

Raises:

ValueError –

If dataset is not in 'geoparquet' format or if the geometry column is not found.

Example

Example of converting buildings to GeoPandas GeoDataFrame - no need for ST_AsText():

buildings = catalog.get_dataset('conus-overture-buildings', 'v2025-03-19.1') gdf = buildings.to_geopandas(""" ... SELECT ... id, ... roof_material, ... geometry ... FROM read_parquet('{s3_path}') ... WHERE roof_material = 'concrete' ... """) gdf.head()

to_xarray ¶

to_xarray(
    *,
    is_icechunk: bool | None = None,
    xarray_open_kwargs: dict | None = None,
    xarray_storage_options: dict | None = None,
) -> Dataset

Convert the dataset to an xarray.Dataset.

Parameters:

is_icechunk (bool | None, default: None ) –

Whether to use icechunk to access the data. - If True: only try using icechunk - If None: try icechunk first, fall back to direct S3 access if it fails - If False: only use direct S3 access
xarray_open_kwargs (dict, default: None ) –

Additional keyword arguments to pass to xarray.open_dataset.
xarray_storage_options (dict, default: None ) –

Storage options for S3 access when not using icechunk.

Returns:

Dataset –

The opened dataset.

Raises:

ValueError –

If the dataset is not in 'zarr' format.
FileNotFoundError –

If the dataset cannot be found or accessed.

CONUS404 helpers¶

Load CONUS404 variables, compute relative humidity, wind rotation and diagnostics. Geographic selection utilities (point/bbox) with CRS-aware transforms.

ocr.conus404 ¶

Functions¶

compute_relative_humidity ¶

compute_relative_humidity(ds: Dataset) -> DataArray

Compute relative humidity from specific humidity, temperature, and pressure.

Parameters:

ds (Dataset) –

Input dataset containing 'Q2' (specific humidity), 'T2' (temperature in K), and 'PSFC' (pressure in Pa).

Returns:

hurs ( DataArray ) –

Relative humidity as a percentage.

compute_wind_speed_and_direction ¶

compute_wind_speed_and_direction(u10: DataArray, v10: DataArray) -> Dataset

Derive hourly wind speed (m/s) and direction (degrees from) using xclim.

Parameters:

u10 (DataArray) –

U component of wind at 10 m (m/s).
v10 (DataArray) –

V component of wind at 10 m (m/s).

Returns:

wind_ds ( Dataset ) –

Dataset containing wind speed ('sfcWind') and wind direction ('sfcWindfromdir').

load_conus404 ¶

load_conus404(add_spatial_constants: bool = True) -> Dataset

Load the CONUS 404 dataset.

Parameters:

add_spatial_constants (bool, default: True ) –

If True, adds spatial constant variables (SINALPHA, COSALPHA) to the dataset.

Returns:

ds ( Dataset ) –

The CONUS 404 dataset.

rotate_winds_to_earth ¶

rotate_winds_to_earth(ds: Dataset) -> tuple[DataArray, DataArray]

Rotate grid-relative 10 m winds (U10,V10) to earth-relative components. Uses SINALPHA / COSALPHA convention from WRF.

Parameters:

ds (Dataset) –

Input dataset containing 'U10', 'V10', 'SINALPHA', and 'COSALPHA'.

Returns:

earth_u ( DataArray ) –

Earth-relative U component of wind at 10 m.
earth_v ( DataArray ) –

Earth-relative V component of wind at 10 m.

Utilities¶

General utilities¶

Helpers for DuckDB (extension loading, S3 secrets), vector sampling, and file transfer.

ocr.utils ¶

Functions¶

apply_s3_creds ¶

apply_s3_creds(region: str = 'us-west-2', *, con: Any | None = None) -> None

Register AWS credentials as a DuckDB SECRET on the given connection.

Parameters:

region (str, default: 'us-west-2' ) –

AWS region used for S3 access.
con (DuckDBPyConnection | None, default: None ) –

Connection to apply credentials to. If None, uses duckdb's default connection (duckdb.sql), preserving prior behavior.

bbox_tuple_from_xarray_extent ¶

bbox_tuple_from_xarray_extent(
    ds: Dataset, x_name: str = 'x', y_name: str = 'y'
) -> tuple[float, float, float, float]

Creates a bounding box from an Xarray Dataset extent.

Parameters:

ds (Dataset) –

Input Xarray Dataset
x_name (str, default: 'x' ) –

Name of x coordinate, by default 'x'
y_name (str, default: 'y' ) –

Name of y coordinate, by default 'y'

Returns:

tuple –

Bounding box tuple in the form: (x_min, y_min, x_max, y_max)

copy_or_upload ¶

copy_or_upload(
    src: UPath,
    dest: UPath,
    overwrite: bool = True,
    chunk_size: int = 16 * 1024 * 1024,
) -> None

Copy a single file from src to dest using UPath/fsspec. - Uses server-side copy if available on the same filesystem (e.g., s3->s3). - Falls back to streaming copy otherwise. - Creates destination parent directories when supported.

Parameters:

src (UPath) –

Source UPath
dest (UPath) –

Destination UPath (file path; if pointing to a directory-like path, src.name is appended)
overwrite (bool, default: True ) –

If False, raises if dest exists
chunk_size (int, default: 16 * 1024 * 1024 ) –

Buffer size for streaming copies

Returns:

None –

extract_points ¶

extract_points(gdf: GeoDataFrame, da: DataArray) -> DataArray

Extract/sample points from a GeoDataFrame to an Xarray DataArray.

Parameters:

gdf (GeoDataFrame) –

Input geopandas GeoDataFrame. Geometry should be points
da (DataArray) –

Input Xarray DataArray

Returns:

DataArray –

DataArray with geometry sampled

Notes

UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

The relatively small size of a building footprint should account for a very small shift in the centroid when calculating from EPSG:4326 vs EPSG:5070.

TODO: Should/can this be a DataArray for typing

geo_sel ¶

geo_sel(
    ds: Dataset,
    *,
    lon: float | None = None,
    lat: float | None = None,
    bbox: tuple[float, float, float, float] | None = None,
    method: str = 'nearest',
    tolerance: float | None = None,
    crs_wkt: str | None = None,
)

Geographic selection helper.

Exactly one of: - (lon AND lat) - (lons AND lats) - bbox=(west, south, east, north)

Parameters:

ds (Dataset) –

Input dataset with x, y coordinates and a valid 'crs' variable with WKT
lon (float, default: None ) –

Longitude of point to select, by default None
lat (float, default: None ) –

Latitude of point to select, by default None
bbox (tuple, default: None ) –

Bounding box to select (west, south, east, north), by default None
method (str, default: 'nearest' ) –

Method to use for point selection, by default 'nearest'
tolerance (float, default: None ) –

Tolerance (in units of the dataset's CRS) for point selection, by default None
crs_wkt (str, default: None ) –

WKT string for the dataset's CRS. If None, attempts to read from ds.crs.attrs['crs_wkt'].

Returns:

Dataset –

Single point: time dimension only Multiple points: adds 'point' dimension BBox: retains y, x subset

get_temp_dir ¶

get_temp_dir() -> Path | None

Get optimal temporary directory path for the current environment.

Returns the current working directory if running in /scratch (e.g., on Coiled clusters), otherwise returns None to use the system default temp directory.

On Coiled clusters, /scratch is bind-mounted directly to the NVMe disk, avoiding Docker overlay filesystem overhead and providing better I/O performance and more available space compared to /tmp which sits on the Docker overlay.

Returns:

Path | None –

Current working directory if in /scratch, None otherwise (uses system default).

Examples:

>>> import tempfile
>>> from ocr.utils import get_temp_dir
>>> with tempfile.TemporaryDirectory(dir=get_temp_dir()) as tmpdir:
...     # tmpdir will be in /scratch on Coiled, system temp otherwise
...     pass

install_load_extensions ¶

install_load_extensions(
    aws: bool = True,
    spatial: bool = True,
    httpfs: bool = True,
    con: Any | None = None,
) -> None

Installs and applies duckdb extensions.

Parameters:

aws (bool, default: True ) –

Install and load AWS extension, by default True
spatial (bool, default: True ) –

Install and load SPATIAL extension, by default True
httpfs (bool, default: True ) –

Install and load HTTPFS extension, by default True
con (DuckDBPyConnection | None, default: None ) –

Connection to apply extensions to. If None, uses duckdb's default

Testing utilities¶

Snapshot testing extensions for xarray and GeoPandas.

ocr.testing ¶

Classes¶

GeoDataFrameSnapshotExtension ¶

Bases: SingleFileSnapshotExtension

Snapshot extension for GeoPandas GeoDataFrames stored as parquet.

Supports both local and remote (S3) storage via environment variable configuration: - SNAPSHOT_STORAGE_PATH: Base path for snapshots (local or s3://bucket/path) Default: s3://carbonplan-scratch/snapshots (configured in tests/conftest.py)

Examples: # Use default S3 storage (no env var needed) pytest tests/test_snapshot.py --snapshot-update

# Override with local storage
SNAPSHOT_STORAGE_PATH=tests/__snapshots__ pytest tests/

# Override with different S3 bucket
SNAPSHOT_STORAGE_PATH=s3://my-bucket/snapshots pytest tests/

Functions¶

diff_lines ¶

diff_lines(serialized_data: Any, snapshot_data: Any) -> Iterator[str]

Generate diff lines for test output.

dirname `classmethod` ¶

dirname(*, test_location: PyTestLocation) -> str

Return the directory for storing snapshots.

get_location `classmethod` ¶

get_location(*, test_location: PyTestLocation, index: SnapshotIndex = 0) -> str

Get the full snapshot location path.

Override to properly handle S3 paths using upath instead of os.path.join.

get_snapshot_name `classmethod` ¶

get_snapshot_name(
    *, test_location: PyTestLocation, index: SnapshotIndex = 0
) -> str

Generate snapshot name based on test name.

Sanitizes the test name to replace problematic characters (e.g., brackets from parametrized tests) with underscores for valid file paths.

matches ¶

matches(*, serialized_data: Any, snapshot_data: Any) -> bool

Check if serialized data matches snapshot using GeoDataFrame comparison.

read_snapshot_data_from_location ¶

read_snapshot_data_from_location(
    *, snapshot_location: str, snapshot_name: str, session_id: str
) -> GeoDataFrame | None

Read parquet snapshot from disk.

serialize ¶

serialize(data: SerializableData, **kwargs: Any) -> Any

Validate that data is a GeoDataFrame. Returns the data unchanged.

write_snapshot_collection `classmethod` ¶

write_snapshot_collection(*, snapshot_collection: SnapshotCollection) -> None

Write snapshot collection to parquet format (local or remote).

XarraySnapshotExtension ¶

Bases: SingleFileSnapshotExtension

Snapshot extension for xarray DataArrays and Datasets stored as zarr.

Supports both local and remote (S3) storage via environment variable configuration: - SNAPSHOT_STORAGE_PATH: Base path for snapshots (local or s3://bucket/path) Default: s3://carbonplan-scratch/snapshots (configured in tests/conftest.py)

Examples: # Use default S3 storage (no env var needed) pytest tests/test_snapshot.py --snapshot-update

# Override with local storage
SNAPSHOT_STORAGE_PATH=tests/__snapshots__ pytest tests/

# Override with different S3 bucket
SNAPSHOT_STORAGE_PATH=s3://my-bucket/snapshots pytest tests/

Functions¶

diff_lines ¶

diff_lines(serialized_data: Any, snapshot_data: Any) -> Iterator[str]

Generate diff lines for test output.

dirname `classmethod` ¶

dirname(*, test_location: PyTestLocation) -> str

Return the directory for storing snapshots.

get_location `classmethod` ¶

get_location(*, test_location: PyTestLocation, index: SnapshotIndex = 0) -> str

Get the full snapshot location path.

Override to properly handle S3 paths using upath instead of os.path.join.

get_snapshot_name `classmethod` ¶

get_snapshot_name(
    *, test_location: PyTestLocation, index: SnapshotIndex = 0
) -> str

Generate snapshot name based on test name.

Sanitizes the test name to replace problematic characters (e.g., brackets from parametrized tests) with underscores for valid file paths.

matches ¶

matches(*, serialized_data: Any, snapshot_data: Any) -> bool

Check if serialized data matches snapshot using approximate comparison.

Uses assert_allclose instead of assert_equal to handle platform-specific numerical differences from OpenCV and scipy operations between macOS and Linux.

read_snapshot_data_from_location ¶

read_snapshot_data_from_location(
    *, snapshot_location: str, snapshot_name: str, session_id: str
) -> Dataset | None

Read zarr snapshot from disk.

serialize ¶

serialize(data: SerializableData, **kwargs: Any) -> Any

Convert DataArray to Dataset for consistent zarr storage. Returns the data unchanged.

write_snapshot_collection `classmethod` ¶

write_snapshot_collection(*, snapshot_collection: SnapshotCollection) -> None

Write snapshot collection to zarr format (local or remote).

Risk analysis¶

Fire risk¶

Core fire/wind risk utilities used by the pipeline (kernels, wind classification, risk composition).

ocr.risks.fire ¶

Functions¶

apply_wind_directional_convolution ¶

apply_wind_directional_convolution(
    da: DataArray,
    iterations: int = 3,
    kernel_size: float = 81.0,
    circle_diameter: float = 35.0,
) -> Dataset

Apply a directional convolution to a DataArray.

Parameters:

da (DataArray) –

The DataArray to apply the convolution to.
iterations (int, default: 3 ) –

The number of iterations to apply the convolution, by default 3
kernel_size (float, default: 81.0 ) –

The size of the kernel, by default 81.0
circle_diameter (float, default: 35.0 ) –

The diameter of the circle, by default 35.0

Returns:

ds ( Dataset ) –

The Dataset with the directional convolution applied

calculate_wind_adjusted_risk ¶

calculate_wind_adjusted_risk(
    *, x_slice: slice, y_slice: slice, buffer: float = 0.15
) -> Dataset

Calculate wind-adjusted fire risk using climate run and wildfire risk datasets.

Parameters:

x_slice (slice) –

Slice object for selecting longitude range.
y_slice (slice) –

Slice object for selecting latitude range.
buffer (float, default: 0.15 ) –

Buffer size in degrees to add around the region for edge effect handling (default 0.15). For 30m EPSG:4326 data, 0.15 degrees ≈ 16.7 km ≈ 540 pixels. This buffer ensures neighborhood operations (convolution, Gaussian smoothing) have adequate context at boundaries.

Returns:

fire_risk ( Dataset ) –

Dataset containing wind-adjusted fire risk variables.

classify_wind_directions ¶

classify_wind_directions(wind_direction_ds: DataArray) -> DataArray

Classify wind directions into 8 cardinal directions (0-7). The classification is:

0: North (337.5-22.5) 1: Northeast (22.5-67.5) 2: East (67.5-112.5) 3: Southeast (112.5-157.5) 4: South (157.5-202.5) 5: Southwest (202.5-247.5) 6: West (247.5-292.5) 7: Northwest (292.5-337.5)

Parameters:

wind_direction_ds (DataArray) –

DataArray containing wind direction in degrees (0-360)

Returns:

result ( DataArray ) –

DataArray with wind directions classified as integers 0-7

compute_modal_wind_direction ¶

compute_modal_wind_direction(distribution: DataArray) -> Dataset

Compute the modal wind direction from the wind direction distribution.

Parameters:

distribution (DataArray) –

Wind direction distribution.

Returns:

mode ( Dataset ) –

Modal wind direction.

compute_wind_direction_distribution ¶

compute_wind_direction_distribution(
    direction: DataArray, fire_weather_mask: DataArray
) -> Dataset

Compute the wind direction distribution during fire weather conditions.

Parameters:

direction (DataArray) –

Wind direction in degrees (0-360).
fire_weather_mask (DataArray) –

Boolean mask indicating fire weather conditions.

Returns:

wind_direction_hist ( Dataset ) –

Wind direction histogram during fire weather conditions.

create_weighted_composite_bp_map ¶

create_weighted_composite_bp_map(
    bp: Dataset,
    wind_direction_distribution: DataArray,
    *,
    distribution_direction_dim: str = 'wind_direction',
    weight_sum_tolerance: float = 1e-05,
) -> DataArray

Create a weighted composite burn probability map using wind direction distribution.

Parameters:

bp (Dataset) –

Dataset containing 9 directional burn probability layers with variables named ['N','NE','E','SE','S','SW','W','NW','circular'] produced by apply_wind_directional_convolution.
wind_direction_distribution (DataArray) –

Probability distribution over 8 cardinal directions with dimension 'wind_direction' and length 8, matching direction labels: ['N','NE','E','SE','S','SW','W','NW'] (order must align). Values should sum to 1 where fire-weather hours exist; may be all 0 where none exist.
distribution_direction_dim (str, default: 'wind_direction' ) –

Name of the dimension in wind_direction_distribution that holds the direction labels, by default 'wind_direction'.
weight_sum_tolerance (float, default: 1e-05 ) –

Tolerance for deviation from 1.0 in the sum of weights, by default

Returns:

weighted ( DataArray ) –

Weighted composite burn probability with same spatial dims as inputs. Name: 'wind_weighted_bp'. Missing (all-zero) distributions yield NaN.

create_wind_informed_burn_probability ¶

create_wind_informed_burn_probability(
    wind_direction_distribution_30m_4326: DataArray, riley_270m_5070: Dataset
) -> DataArray

Create wind-informed burn probability dataset by applying directional convolution and creating a weighted composite burn probability map.

Parameters:

wind_direction_distribution_30m_4326 (DataArray) –

Wind direction distribution data at 30m resolution in EPSG:4326 projection.
riley_270m_5070 (DataArray) –

Riley et al. (2011) burn probability data at 270m resolution in EPSG:5070 projection.

Returns:

smoothed_final_bp ( DataArray ) –

Smoothed wind-informed burn probability data at 30m resolution in EPSG:4326 projection.

direction_histogram ¶

direction_histogram(data_array: DataArray) -> DataArray

Compute direction histogram on xarray DataArray with dask chunks.

Parameters:

data_array (DataArray) –

Input data array containing direction indices (expected to be integers 0-7)

Returns:

DataArray –

Normalized histogram counts as a probability distribution

fosberg_fire_weather_index ¶

fosberg_fire_weather_index(
    hurs: DataArray, T2: DataArray, sfcWind: DataArray
) -> DataArray

Calculate the Fosberg Fire Weather Index (FFWI) based on relative humidity, temperature, and wind speed. taken from wikifire.wsl.ch/tiki-indexb1d5.html?page=Fosberg+fire+weather+index&structure=Fire hurs, T2, sfcWind are arrays

Parameters:

hurs (DataArray) –

Relative humidity in percentage (0-100).
T2 (DataArray) –

Temperature
sfcWind (DataArray) –

Wind speed in meters per second.

Returns:

DataArray –

Fosberg Fire Weather Index (FFWI).

generate_weights ¶

generate_weights(
    method: Literal['skewed', 'circular_focal_mean'] = 'skewed',
    kernel_size: float = 81.0,
    circle_diameter: float = 35.0,
) -> ndarray

Generate a 2D array of weights for a circular kernel.

Parameters:

method (str, default: 'skewed' ) –

The method to use for generating weights. Options are 'skewed' or 'circular_focal_mean'. 'skewed' generates an elliptical kernel to simulate wind directionality. 'circular_focal_mean' generates a circular kernel, by default 'skewed'
kernel_size (float, default: 81.0 ) –

The size of the kernel, by default 81.0
circle_diameter (float, default: 35.0 ) –

The diameter of the circle, by default 35.0

Returns:

weights ( ndarray ) –

A 2D array of weights for the circular kernel.

generate_wind_directional_kernels ¶

generate_wind_directional_kernels(
    kernel_size: float = 81.0, circle_diameter: float = 35.0
) -> dict[str, ndarray]

Generate a dictionary of 2D arrays of weights for circular kernels oriented in different directions.

Parameters:

kernel_size (float, default: 81.0 ) –

The size of the kernel, by default 81.0
circle_diameter (float, default: 35.0 ) –

The diameter of the circle, by default 35.0

Returns:

kernels ( dict[str, ndarray] ) –

A dictionary of 2D arrays of weights for circular kernels oriented in different directions.

Internal pipeline modules¶

Internal API

These modules are used internally by the pipeline and are not intended for direct public consumption. They are documented here for completeness and advanced use cases.

Batch managers¶

Orchestration backends for local and Coiled execution.

ocr.deploy.managers ¶

Classes¶

AbstractBatchManager ¶

Bases: BaseModel

Abstract base class for batch managers.

Functions¶

submit_job ¶

submit_job(command: str, name: str, kwargs: dict[str, Any])

Wait for the batch job to complete.

wait_for_completion ¶

wait_for_completion(exit_on_failure: bool = False)

Get the batch ID.

CoiledBatchManager ¶

Bases: AbstractBatchManager

Coiled batch manager for managing batch jobs.

Functions¶

submit_job ¶

submit_job(command: str, name: str, kwargs: dict[str, Any]) -> str

Submit a job to Coiled batch.

Parameters:

command (str) –

The command to run.
name (str) –

The name of the job.
kwargs (dict) –

Additional keyword arguments to pass to coiled.batch.run.

Returns:

job_id ( str ) –

The ID of the submitted job.

wait_for_completion ¶

wait_for_completion(exit_on_failure: bool = False)

Wait for all tracked jobs to complete.

Parameters:

exit_on_failure (bool, default: False ) –

If True, raise an Exception immediately when a job failure is detected.

Returns:

completed, failed : tuple[set[str], set[str]] –

A tuple of (completed_job_ids, failed_job_ids). If exit_on_failure is True and a failure is encountered the method will raise before returning.

LocalBatchManager ¶

Bases: AbstractBatchManager

Local batch manager for running jobs locally using subprocess.

Functions¶

del ¶

__del__()

Clean up the executor when the manager is destroyed.

model_post_init ¶

model_post_init(__context)

Initialize the thread pool executor after model creation.

submit_job ¶

submit_job(command: str, name: str, kwargs: dict[str, Any]) -> str

Submit a job to run locally.

Parameters:

command (str) –

The command to run.
name (str) –

The name of the job.
kwargs (dict) –

Additional keyword arguments to pass to subprocess.run.

Returns:

job_id ( str ) –

The ID of the submitted job.

wait_for_completion ¶

wait_for_completion(exit_on_failure: bool = False)

Wait for all tracked jobs to complete.

Parameters:

exit_on_failure (bool, default: False ) –

If True, raise an Exception immediately when a job failure is detected.

Returns:

completed, failed : tuple[set[str], set[str]] –

A tuple of (completed_job_ids, failed_job_ids). If exit_on_failure is True and a failure is encountered the method will raise before returning.

CLI application¶

Command-line interface exposed as the ocr command. For detailed usage and options, see the tutorials section.

ocr¶

Run OCR deployment pipeline on Coiled

Usage:

ocr [OPTIONS] COMMAND [ARGS]...

Options:

  --install-completion  Install completion for the current shell.
  --show-completion     Show completion for the current shell, to copy it or
                        customize the installation.
  --help                Show this message and exit.

Subcommands

aggregate-region-risk-summary-stats: Generate time-horizon based statistical summaries for county and tract level PMTiles creation
create-building-pmtiles: Create PMTiles from the consolidated geoparquet file.
create-pyramid: Create Pyramid
create-regional-pmtiles: Create PMTiles for regional risk statistics (counties and tracts).
ingest-data: Ingest and process input datasets
partition-buildings: Partition buildings geoparquet by state and county FIPS codes.
process-region: Calculate and write risk for a given region to Icechunk CONUS template.
run: Run the OCR deployment pipeline. This will process regions, aggregate geoparquet files,
write-aggregated-region-analysis-files: Write aggregated statistical summaries for each region (county and tract).

ocr aggregate-region-risk-summary-stats¶

Generate time-horizon based statistical summaries for county and tract level PMTiles creation

Usage:

ocr aggregate-region-risk-summary-stats [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.16xlarge]
  --help                         Show this message and exit.

ocr create-building-pmtiles¶

Create PMTiles from the consolidated geoparquet file.

Usage:

ocr create-building-pmtiles [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.8xlarge]
  --disk-size INTEGER            Disk size in GB (Coiled only).  \[default:
                                 250]
  --help                         Show this message and exit.

ocr create-pyramid¶

Create Pyramid

Usage:

ocr create-pyramid [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: m8g.16xlarge]
  --help                         Show this message and exit.

ocr create-regional-pmtiles¶

Create PMTiles for regional risk statistics (counties and tracts).

Usage:

ocr create-regional-pmtiles [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.8xlarge]
  --disk-size INTEGER            Disk size in GB (Coiled only).  \[default:
                                 250]
  --help                         Show this message and exit.

ocr ingest-data¶

Ingest and process input datasets

Usage:

ocr ingest-data [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

Subcommands

download: Download raw source data for a dataset.
list-datasets: List all available datasets that can be ingested.
process: Process downloaded data and upload to S3/Icechunk.
run-all: Run the complete pipeline: download, process, and cleanup.

ocr ingest-data download¶

Download raw source data for a dataset.

Usage:

ocr ingest-data download [OPTIONS] DATASET

Options:

  DATASET    Name of the dataset to download  \[required]
  --dry-run  Preview operations without executing
  --debug    Enable debug logging
  --help     Show this message and exit.

ocr ingest-data list-datasets¶

List all available datasets that can be ingested.

Usage:

ocr ingest-data list-datasets [OPTIONS]

Options:

  --help  Show this message and exit.

ocr ingest-data process¶

Process downloaded data and upload to S3/Icechunk.

Usage:

ocr ingest-data process [OPTIONS] DATASET

Options:

  DATASET                       Name of the dataset to process  \[required]
  --dry-run                     Preview operations without executing
  --use-coiled                  Use Coiled for distributed processing
  --software TEXT               Software environment to use (required if
                                --use-coiled is set)
  --debug                       Enable debug logging
  --overture-data-type TEXT     For overture-maps: which data to process
                                (buildings, addresses, or both)  \[default:
                                both]
  --census-geography-type TEXT  For census-tiger: which geography to process
                                (blocks, tracts, counties, or all)  \[default:
                                all]
  --census-subset-states TEXT   For census-tiger: subset of states to process
                                (e.g., California Oregon)
  --help                        Show this message and exit.

ocr ingest-data run-all¶

Run the complete pipeline: download, process, and cleanup.

Usage:

ocr ingest-data run-all [OPTIONS] DATASET

Options:

  DATASET                       Name of the dataset to process  \[required]
  --dry-run                     Preview operations without executing
  --use-coiled                  Use Coiled for distributed processing
  --debug                       Enable debug logging
  --overture-data-type TEXT     For overture-maps: which data to process
                                (buildings, addresses, or both)  \[default:
                                both]
  --census-geography-type TEXT  For census-tiger: which geography to process
                                (blocks, tracts, counties, or all)  \[default:
                                all]
  --census-subset-states TEXT   For census-tiger: subset of states to process
                                (e.g., California Oregon)
  --help                        Show this message and exit.

ocr partition-buildings¶

Partition buildings geoparquet by state and county FIPS codes.

Usage:

ocr partition-buildings [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: c8g.12xlarge]
  --help                         Show this message and exit.

ocr process-region¶

Calculate and write risk for a given region to Icechunk CONUS template.

Usage:

ocr process-region [OPTIONS] REGION_ID

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  REGION_ID                      Region ID to process, e.g., y10_x2
                                 \[required]
  -t, --risk-type [fire]         Type of risk to calculate  \[default: fire]
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
  --init-repo                    Initialize Icechunk repository (if not
                                 already initialized).
  --help                         Show this message and exit.

ocr run¶

Run the OCR deployment pipeline. This will process regions, aggregate geoparquet files, and create PMTiles layers for the specified risk type.

Usage:

ocr run [OPTIONS]

Options:

  -e, --env-file PATH             Path to the environment variables file.
                                  These will be used to set up the
                                  OCRConfiguration
  -r, --region-id TEXT            Region IDs to process, e.g., y10_x2
  --all-region-ids                Process all valid region IDs
  -t, --risk-type [fire]          Type of risk to calculate  \[default: fire]
  --write-regional-stats          Write aggregated statistical summaries for
                                  each region (one file per region type with
                                  stats like averages, medians, percentiles,
                                  and histograms)
  --create-pyramid                Create ndpyramid / multiscale zarr for web-
                                  visualization
  -p, --platform [coiled|local]   Platform to run the pipeline on  \[default:
                                  local]
  --wipe                          Wipe the icechunk and vector data storages
                                  before running the pipeline
  --dispatch-platform [coiled|local]
                                  If set, schedule this run command on the
                                  specified platform instead of running
                                  inline.
  --vm-type TEXT                  VM type override for dispatch-platform
                                  (Coiled only).
  --process-retries INTEGER RANGE
                                  Number of times to retry failed process-
                                  region tasks (Coiled only). 0 disables
                                  retries.  \[default: 2; x>=0]
  --help                          Show this message and exit.

ocr write-aggregated-region-analysis-files¶

Write aggregated statistical summaries for each region (county and tract).

Creates one file per region type containing aggregated statistics for ALL regions, including building counts, average/median risk values, percentiles (p90, p95, p99), and histograms. Outputs in geoparquet, geojson, and csv formats.

Usage:

ocr write-aggregated-region-analysis-files [OPTIONS]

Options:

  -e, --env-file PATH            Path to the environment variables file. These
                                 will be used to set up the OCRConfiguration
  -p, --platform [coiled|local]  If set, schedule this command on the
                                 specified platform instead of running inline.
  --vm-type TEXT                 Coiled VM type override (Coiled only).
                                 \[default: r8g.4xlarge]
  --help                         Show this message and exit.

API reference¶

Package overview¶

ocr ¶

Core modules¶

Configuration¶

ocr.config ¶

Classes¶

ChunkingConfig ¶

Attributes¶

chunk_info cached property ¶

extent_as_tuple_5070 cached property ¶

valid_region_ids cached property ¶

Functions¶

bbox_from_wgs84 ¶

chunk_id_to_slice ¶

chunks_to_slices ¶

get_chunk_mapping ¶

get_chunks_for_bbox ¶

index_to_coords ¶

plot_all_chunks ¶

region_id_chunk_lookup ¶

region_id_slice_lookup ¶

region_id_to_latlon_slices ¶

visualize_chunks_on_conus ¶

IcechunkConfig ¶

Attributes¶

uri cached property ¶

Functions¶

commit_messages_ancestry ¶

create_template ¶

delete ¶

init_repo ¶

insert_region_uncooperative ¶

model_post_init ¶

pretty_paths ¶

processed_regions ¶

repo_and_session ¶

wipe ¶

OCRConfig ¶

Functions¶

pretty_paths ¶

resolve_region_ids ¶

select_region_ids ¶

PyramidConfig ¶

Functions¶

model_post_init ¶

wipe ¶

VectorConfig ¶

Attributes¶

block_summary_stats_uri cached property ¶

counties_summary_stats_uri cached property ¶

tracts_summary_stats_uri cached property ¶

Functions¶

model_post_init ¶

pretty_paths ¶

upath_delete ¶

wipe ¶

Functions¶

load_config ¶

Type definitions¶

ocr.types ¶

Classes¶

RiskType ¶

Data access¶

Datasets¶

ocr.datasets ¶

Classes¶

Catalog ¶

Functions¶

__repr__ ¶

__str__ ¶

get_dataset ¶

Dataset ¶

Functions¶

query_geoparquet ¶

Then convert to GeoDataFrame¶

to_geopandas ¶

to_xarray ¶

CONUS404 helpers¶

ocr.conus404 ¶

chunk_info `cached` `property` ¶

extent_as_tuple_5070 `cached` `property` ¶

valid_region_ids `cached` `property` ¶

uri `cached` `property` ¶

block_summary_stats_uri `cached` `property` ¶

counties_summary_stats_uri `cached` `property` ¶

tracts_summary_stats_uri `cached` `property` ¶

repr ¶

str ¶

dirname `classmethod` ¶

get_location `classmethod` ¶

get_snapshot_name `classmethod` ¶

write_snapshot_collection `classmethod` ¶

dirname `classmethod` ¶

get_location `classmethod` ¶

get_snapshot_name `classmethod` ¶

write_snapshot_collection `classmethod` ¶

del ¶