Skip to content

Input datasets (technical reference + how-to)

This page documents the main input datasets used by OCR and shows how to access them programmatically via the ocr dataset catalog. Treat this as a technical reference for dataset names, example usage, and ingestion notes.

Accessing the catalog

from ocr import catalog

# List datasets
print(catalog)

# Load a dataset as an xarray / geopandas object
# rps_30 = catalog.get_dataset('USFS-wildfire-risk-communities').to_xarray()

Tensor data (raster / Zarr)

These are n-dimensional raster datasets stored in Zarr/Icechunk stores.

USFS Wildfire Risk to Communities

  • Source: USFS wildfire risk products (see USFS data catalog).
  • Ingested to: input-data/tensor/USFS_fire_risk/
  • Typical usage:
from ocr import catalog
rps_30 = catalog.get_dataset('USFS-wildfire-risk-communities').to_xarray()

USFS climate runs (2011 / 2047)

  • Source: probabilistic wildfire risk components for historical and future climates.
  • These are stored as zipped archives that the ingestion scripts expand into Icechunk stores.
climate_run_2011 = catalog.get_dataset('2011-climate-run-30m-4326').to_xarray()
climate_run_2047 = catalog.get_dataset('2047-climate-run-30m-4326').to_xarray()

Wind datasets

  • Wind datasets and versions may change; if you add or switch wind sources, update the ingestion script under input-data/ and register the new dataset with the ocr catalog.

Vector data

Vector data are building footprints, administrative boundaries, and other GIS vector layers used for exposure and aggregation.

Overture buildings

  • Source: Overture building datasets (see Overture docs)
  • Ingested subset for CONUS in input-data/vector/overture_vector/
conus_buildings = catalog.get_dataset('conus-overture-buildings')

Ingestion notes

  • All ingestion scripts live in input-data/. When adding a new dataset:
    1. Add a script under input-data/ that downloads, preprocesses, and writes data to an Icechunk store or geoparquet.
    2. Add a registration entry in the ocr catalog so catalog.get_dataset(name) returns a usable object.
    3. Add a short How-to in docs/how-to/ describing provenance and any license constraints, and an explanatory note in docs/explanations/ if needed.

Contact the maintainers if you need access to private data buckets or credentials to download certain datasets.