Input datasets (technical reference + how-to)¶
This page documents the main input datasets used by OCR and shows how to access them programmatically via the ocr dataset catalog. Treat this as a technical reference for dataset names, example usage, and ingestion notes.
Accessing the catalog
from ocr import catalog
# List datasets
print(catalog)
# Load a dataset as an xarray / geopandas object
# rps_30 = catalog.get_dataset('USFS-wildfire-risk-communities').to_xarray()
Tensor data (raster / Zarr)
These are n-dimensional raster datasets stored in Zarr/Icechunk stores.
USFS Wildfire Risk to Communities
- Source: USFS wildfire risk products (see USFS data catalog).
- Ingested to:
input-data/tensor/USFS_fire_risk/ - Typical usage:
USFS climate runs (2011 / 2047)
- Source: probabilistic wildfire risk components for historical and future climates.
- These are stored as zipped archives that the ingestion scripts expand into Icechunk stores.
climate_run_2011 = catalog.get_dataset('2011-climate-run-30m-4326').to_xarray()
climate_run_2047 = catalog.get_dataset('2047-climate-run-30m-4326').to_xarray()
Wind datasets
- Wind datasets and versions may change; if you add or switch wind sources, update the ingestion script under
input-data/and register the new dataset with theocrcatalog.
Vector data
Vector data are building footprints, administrative boundaries, and other GIS vector layers used for exposure and aggregation.
Overture buildings
- Source: Overture building datasets (see Overture docs)
- Ingested subset for CONUS in
input-data/vector/overture_vector/
Ingestion notes
- All ingestion scripts live in
input-data/. When adding a new dataset:- Add a script under
input-data/that downloads, preprocesses, and writes data to an Icechunk store or geoparquet. - Add a registration entry in the
ocrcatalog socatalog.get_dataset(name)returns a usable object. - Add a short How-to in
docs/how-to/describing provenance and any license constraints, and an explanatory note indocs/explanations/if needed.
- Add a script under
Contact the maintainers if you need access to private data buckets or credentials to download certain datasets.