Defining a categorical scoring system¶
This notebook explores how we translate a continuous system of risk values into a 11-step categorical scoring system.
import duckdb
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
duckdb.sql("""INSTALL SPATIAL; LOAD SPATIAL; INSTALL HTTPFS; LOAD HTTPFS""")
%%time
version = '0.13.1'
dataset_uri = f's3://carbonplan-ocr/output/fire-risk/vector/production/v{version}/geoparquet/buildings.parquet/**'
df = duckdb.sql(f"""
SELECT wind_risk_2011, wind_risk_2047
FROM read_parquet('{dataset_uri}', hive_partitioning = TRUE)
""").df()
df.head()
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
CPU times: user 3min 59s, sys: 14.3 s, total: 4min 13s Wall time: 26.1 s
| wind_risk_2011 | wind_risk_2047 | |
|---|---|---|
| 0 | 0.011663 | 0.011582 |
| 1 | 0.021331 | 0.011899 |
| 2 | 0.023750 | 0.032104 |
| 3 | 0.035560 | 0.017434 |
| 4 | 0.012157 | 0.012022 |
The distribution of risk values in this dataset is skewed with a very long tail at higher levels of risk. This skewed distribution can make it hard to distinguish both differences among risk levels at both higher and lower levels of risk. We overcome this, we design a scoring system with bins whose relative prevalence of buildings descend monotonically with increasing risk scores. We design the bins like this so that we can distinguish among risk values across the highly heterogeneous domain where risk values span multiple orders of magnitude.
Our approach is inspired by the WRC's increasingly fine percentile increments at higher levels of risk, as described in the supplemental materials for the Wildfire Risk to Communities project, specifically the WRC_V2_DataPercentiles.xlsx file found here. The WRC project uses "class breakpoints at 40th, 70th, 90th, and 95th percentile." According to their documentation, that translates to breakpoints at the following values of RPS: 0.0188485, 0.0940430, 0.4081965, 0.7222983. However, they based their percentiles and breakpoints on pixels as opposed to buildings, so they are in appropriate for us to use out-of-the-box. Given our focus on buildings, we will design our bins according to the distribuiton of risk values for buildings.
We began with drafting a set of round number percentile scale breakpoints which grow increasingly close together as they near 100: [75, 83, 88, 93, 96, 98, 99, 99.9, 99.99].
percentiles = np.array([75, 82.5, 88, 92.5, 96, 98, 99.5, 99.9, 99.99])
We then calculate the RPS value for each of these percentiles.
bins = np.percentile(df[['wind_risk_2011']].values, percentiles)
bins = np.insert(bins, 0, 0, axis=0)
bins = np.insert(bins, 10, 100, axis=0)
bins
array([0.00000000e+00, 1.12076905e-02, 2.12524687e-02, 3.51522379e-02,
5.99227641e-02, 1.12777308e-01, 2.04892129e-01, 5.50125688e-01,
1.25295286e+00, 2.93563206e+00, 1.00000000e+02])
All of this is probably easier shown than explained. When we use the bins calculated above to assign risk scores to the buildings in our dataset, the distribution of risk scores looks like below.
values = df['wind_risk_2011'].values
counts, edges = np.histogram(values, bins=bins)
x = np.arange(len(counts))
width = 0.8
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 4))
ax1.bar(x, counts, width=width, align='edge', edgecolor='black')
ax2.bar(x, counts, width=width, align='edge', edgecolor='black')
ax1.set_ylim(1e8, 1.2e8) # outliers only
ax2.set_ylim(0, 0.2e8)
labels = [f'{bins[i]:.3f}\nScore {j}' for i, j in enumerate(range(len(bins) - 1))]
ax2.set_xticks(x, labels) # , ha='right')
d = 0.015 # how big to make the diagonal lines in axes coordinates
# arguments to pass to plot, just so we don't keep repeating them
kwargs = dict(transform=ax1.transAxes, color='k', clip_on=False)
ax1.plot((-d, +d), (-d, +d), **kwargs) # top-left diagonal
ax1.plot((1 - d, 1 + d), (-d, +d), **kwargs) # top-right diagonal
kwargs.update(transform=ax2.transAxes) # switch to the bottom axes
ax2.plot((-d, +d), (1 - d, 1 + d), **kwargs) # bottom-left diagonal
ax2.plot((1 - d, 1 + d), (1 - d, 1 + d), **kwargs) # bottom-right diagonal
ax1.xaxis.tick_top()
ax1.tick_params(labeltop=False) # don't put tick labels at the top
ax2.xaxis.tick_bottom()
# plt.ylim(0,0.2e8)
ax1.spines['bottom'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.set_xlabel('Value range')
ax2.set_ylabel('Count')
ax1.set_title('Counts of buildings in each score bin')
# plt.tight_layout()
# plt.show()
Text(0.5, 1.0, 'Counts of buildings in each score bin')
Let's update the bin breakpoints to rounder numbers to support interpretability. Then, let's confirm that the rounding update still preserves general pattern of higher risk scores being less prevalent.
rounded_bins = [0.01, 0.02, 0.035, 0.06, 0.1, 0.2, 0.5, 1, 3]
We then assess the original percentiles from these rounded bins to ensure that the percentiles still decrease monotonically, with breakpoints that grow increasingly close together.
percentiles = stats.percentileofscore(df['wind_risk_2011'].values, rounded_bins)
print(percentiles)
[73.70565757 81.79529099 87.95673856 92.50876686 95.4481279 97.9411599 99.41502368 99.83280266 99.99057545]
bins = rounded_bins
bins = np.insert(bins, 0, 0, axis=0)
bins = np.insert(bins, 10, 100, axis=0)
As shown below, the bin breakpoints at higher risk levels are more spaced out (and more clustered at lower risk levels), while the corresponding percentiles are more clustered at higher risk levels and more spaced out at lower risk levels.
print(bins)
print(percentiles)
[0.0e+00 1.0e-02 2.0e-02 3.5e-02 6.0e-02 1.0e-01 2.0e-01 5.0e-01 1.0e+00 3.0e+00 1.0e+02] [73.70565757 81.79529099 87.95673856 92.50876686 95.4481279 97.9411599 99.41502368 99.83280266 99.99057545]
values = df['wind_risk_2011'].values
counts, edges = np.histogram(values, bins=bins)
x = np.arange(len(counts))
width = 0.8
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 4))
ax1.bar(x, counts, width=width, align='edge', edgecolor='black')
ax2.bar(x, counts, width=width, align='edge', edgecolor='black')
ax1.set_ylim(1e8, 1.2e8) # outliers only
ax2.set_ylim(0, 0.2e8)
labels = [f'{bins[i]:.3f}\nScore {j}' for i, j in enumerate(range(len(bins) - 1))]
ax2.set_xticks(x, labels) # , ha='right')
d = 0.015 # how big to make the diagonal lines in axes coordinates
# arguments to pass to plot, just so we don't keep repeating them
kwargs = dict(transform=ax1.transAxes, color='k', clip_on=False)
ax1.plot((-d, +d), (-d, +d), **kwargs) # top-left diagonal
ax1.plot((1 - d, 1 + d), (-d, +d), **kwargs) # top-right diagonal
kwargs.update(transform=ax2.transAxes) # switch to the bottom axes
ax2.plot((-d, +d), (1 - d, 1 + d), **kwargs) # bottom-left diagonal
ax2.plot((1 - d, 1 + d), (1 - d, 1 + d), **kwargs) # bottom-right diagonal
ax1.xaxis.tick_top()
ax1.tick_params(labeltop=False) # don't put tick labels at the top
ax2.xaxis.tick_bottom()
# plt.ylim(0,0.2e8)
ax1.spines['bottom'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.set_xlabel('Value range')
ax2.set_ylabel('Count')
ax1.set_title('Counts of buildings in each score bin')
# plt.tight_layout()
plt.show()
Then let's see what the risk values look like for the future scenario.
values = df['wind_risk_2047'].values
counts, edges = np.histogram(values, bins=bins)
x = np.arange(len(counts))
width = 0.8
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 4))
ax1.bar(x, counts, width=width, align='edge', edgecolor='black')
ax2.bar(x, counts, width=width, align='edge', edgecolor='black')
ax1.set_ylim(1e8, 1.2e8) # outliers only
ax2.set_ylim(0, 0.2e8)
labels = [f'{bins[i]:.3f}\nScore {j}' for i, j in enumerate(range(len(bins) - 1))]
ax2.set_xticks(x, labels) # , ha='right')
d = 0.015 # how big to make the diagonal lines in axes coordinates
# arguments to pass to plot, just so we don't keep repeating them
kwargs = dict(transform=ax1.transAxes, color='k', clip_on=False)
ax1.plot((-d, +d), (-d, +d), **kwargs) # top-left diagonal
ax1.plot((1 - d, 1 + d), (-d, +d), **kwargs) # top-right diagonal
kwargs.update(transform=ax2.transAxes) # switch to the bottom axes
ax2.plot((-d, +d), (1 - d, 1 + d), **kwargs) # bottom-left diagonal
ax2.plot((1 - d, 1 + d), (1 - d, 1 + d), **kwargs) # bottom-right diagonal
ax1.xaxis.tick_top()
ax1.tick_params(labeltop=False) # don't put tick labels at the top
ax2.xaxis.tick_bottom()
# plt.ylim(0,0.2e8)
ax1.spines['bottom'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.set_xlabel('Value range')
ax2.set_ylabel('Count')
ax1.set_title('Counts of buildings in each score bin')
# plt.tight_layout()
plt.show()