Access OCR Output Data¶
This guide shows how to access OCR's wind-adjusted fire risk output data using Icechunk, a versioned data format for cloud-native geospatial data.
Prerequisites¶
- Python environment with
xarray,icechunk,duckdbandlonboardinstalled.
What you'll learn¶
- How to open and inspect the fire risk raster dataset
- How to work with specific variables and spatial subsets
- How to open and subset the raster sampled building vector dataset
Raster / Xarray¶
Import required libraries¶
import icechunk
import xarray as xr
Connect to the Icechunk repository¶
OCR's production fire risk data is stored in an Icechunk repository on S3. We'll connect to version v0.9.0 of the wind-adjusted fire risk output. For valid versions, check out the OCR's GitHub release page.
# Configure S3 storage for the Icechunk repository
version = 'v0.12.0'
storage = icechunk.s3_storage(
bucket='us-west-2.opendata.source.coop',
prefix=f'carbonplan/carbonplan-ocr/output/fire-risk/tensor/production/{version}/ocr.icechunk',
anonymous=True,
)
# Open the repository
repo = icechunk.Repository.open(storage)
# Create a read-only session on the main branch
session = repo.readonly_session('main')
Open the dataset with xarray¶
# Open the dataset
ds = xr.open_dataset(session.store, engine='zarr', chunks={})
ds
Explore the data variables¶
The dataset contains wind-adjusted fire risk metrics. Let's examine the available variables:
# List all data variables
print('Data variables:')
for var in ds.data_vars:
print(f' - {var}: {ds[var].attrs.get("long_name", "No description")}')
Select a spatial subset¶
Extract data for a specific geographic region using coordinate slicing:
# Example: Select data for California region
california_subset = ds.sel(
latitude=slice(42, 32), # Southern to Northern California
longitude=slice(-125, -114), # Western to Eastern California
)
california_subset
Vector building dataset¶
Next we will open and query the raster sampled building dataset stored in the geoparquet format.
Import required libraries¶
Here we'll import duckdb and load the spatial extension. The combination of duckdb spatial and geoparquet allows us to perform GIS queries that are beyond desktop GIS capabilities.
import duckdb
duckdb.sql("""INSTALL SPATIAL; LOAD SPATIAL; INSTALL HTTPFS; LOAD HTTPFS""")
dataset_uri = f's3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/output/fire-risk/vector/production/{version}/geoparquet/buildings.parquet/**'
Examine the dataset¶
- The SQL
describecommand shows us that the dataset contains raster sampled variables as well as geometry columns and regional identifiers.
duckdb.sql(f"""
DESCRIBE
SELECT
*
FROM
read_parquet('{dataset_uri}', hive_partitioning = TRUE)
""")
Load the first few rows¶
Using the SQL LIMIT command, we can get just the first few rows of the dataset.
duckdb.sql(f"""
SELECT
*
FROM
read_parquet('{dataset_uri}', hive_partitioning = TRUE) LIMIT 5""")
Subset by state and county¶
This geoparquet is partitioned by state and county using FIPS codes.
Let's select data in LA County in California.
CA_FIPS_CODE = '06'
LA_FIPS_CODE = '037'
Get a count of the number of records in LA County¶
Using the COUNT SQL syntax, we can get the total number of entires in our query.
duckdb.sql(f"""
SELECT
COUNT(*) AS LA_building_count
FROM
read_parquet('{dataset_uri}', hive_partitioning = TRUE)
WHERE
state_fips = '{CA_FIPS_CODE}'
AND county_fips = '{LA_FIPS_CODE}'""")
Subset by bounding box¶
Looks like there are about 3 million building polygons in our query. Let's subset that further.
We can get a bounding box (bbox) for an area around the Palisades fire.
palisades_bbox = (-118.761864, 34.026381, -118.466263, 34.152972)
palisades_query = duckdb.sql(f"""
SELECT
wind_risk_2011,
geometry
FROM
read_parquet('{dataset_uri}', hive_partitioning = TRUE)
WHERE
state_fips = '{CA_FIPS_CODE}'
AND county_fips = '{LA_FIPS_CODE}'
AND bbox.xmin BETWEEN {palisades_bbox[0]} AND {palisades_bbox[2]}
AND bbox.ymin BETWEEN {palisades_bbox[1]} AND {palisades_bbox[3]}""")
from lonboard import Map, PolygonLayer
from lonboard.colormap import apply_continuous_cmap
from matplotlib import colormaps
layer = PolygonLayer.from_duckdb(
palisades_query,
get_fill_color=apply_continuous_cmap(
palisades_query.fetchdf()['wind_risk_2011'], colormaps['YlOrRd']
),
pickable=True,
)
m = Map(layer, show_tooltip=True)
m