Project structure¶

This page documents the OCR repository layout and explains the purpose of key directories and files. Use this as a technical reference when contributing code, adding datasets, or extending documentation.

Repository overview¶

The OCR platform is organized into distinct layers: the core Python package (ocr/), supporting infrastructure (configuration, deployment, testing), input data management, exploratory research notebooks, and comprehensive documentation. The structure follows best practices for scientific Python projects with emphasis on reproducibility, modularity, and cloud-native execution.

graph TB
    subgraph Repository["OCR Repository"]
        subgraph Core["Core Package"]
            OCR[ocr/]
            OCR --> CONFIG[config.py - Configuration models]
            OCR --> TYPES[types.py - Type definitions]
            OCR --> DATASETS[datasets.py - Data catalog]
            OCR --> CONUS[conus404.py - Climate data]
            OCR --> UTILS[utils.py - Utilities]
            OCR --> TESTING[testing.py - Test helpers]

            OCR --> DEPLOY[deploy/]
            DEPLOY --> CLI[cli.py - CLI app]
            DEPLOY --> MANAGERS[managers.py - Orchestration]

            OCR --> PIPELINE[pipeline/]
            PIPELINE --> PROCESS[process_region.py]
            PIPELINE --> PARTITION[partition.py]
            PIPELINE --> STATS[fire_wind_risk_regional_aggregator.py]
            PIPELINE --> PMTILES[create_building_pmtiles.py + create_regional_pmtiles.py]
            PIPELINE --> WRITERS[write_*_analysis_files.py]

            OCR --> RISKS[risks/]
            RISKS --> FIRE[fire.py - Risk models]
        end

        subgraph Data["Data & Inputs"]
            INPUT[input-data/]
            INPUT --> TENSOR[tensor/ - CONUS404, USFS fire risk]
            INPUT --> VECTOR[vector/ - Buildings, regions, structures]
        end

        subgraph Research["Research & Exploration"]
            NOTEBOOKS[notebooks/]
            NOTEBOOKS --> NB1[Wind analysis notebooks]
            NOTEBOOKS --> NB2[Fire risk kernels]
            NOTEBOOKS --> NB3[Scaling experiments]
        end

        subgraph Docs["Documentation"]
            DOCSDIR[docs/]
            DOCSDIR --> HOWTO[how-to/ - Guides]
            DOCSDIR --> METHODS[methods/ - Science docs]
            DOCSDIR --> REFERENCE[reference/ - API & specs]
            DOCSDIR --> TUTORIALS[tutorials/ - Workflows]
        end

        subgraph Testing["Testing & QA"]
            TESTS[tests/]
            TESTS --> UNIT[Unit tests]
            TESTS --> INTEGRATION[Integration tests]
            TESTS --> SNAPSHOTS[Snapshot tests]
        end

        subgraph Config["Configuration & Build"]
            PYPROJECT[pyproject.toml - Package config]
            PIXI[pixi.lock - Environment lock]
            MKDOCS[mkdocs.yml - Docs config]
            ENV[ocr-*.env - Environment vars]
            GITHUB[.github/ - CI/CD workflows]
        end

        subgraph Infra["Infrastructure"]
            BUCKET[bucket_creation/ - S3 setup]
        end
    end

    style Core fill:#e1f5ff
    style Data fill:#fff4e1
    style Research fill:#f3e8ff
    style Docs fill:#e8f5e9
    style Testing fill:#ffebee
    style Config fill:#fafafa
    style Infra fill:#fff9c4

Hold "Alt" / "Option" to enable pan & zoom

Core package (`ocr/`)¶

Contains all production code organized into logical modules:

Top-level modules¶

Module	Purpose
`config.py`	Pydantic models for storage, chunking, Coiled, and processing configuration
`types.py`	Type definitions and enums (Environment, Platform, RiskType, RegionType)
`datasets.py`	Catalog abstraction for Zarr and GeoParquet datasets in S3 storage
`conus404.py`	CONUS404 climate data helpers: load variables, compute humidity, wind transformations
`utils.py`	DuckDB utilities, S3 secrets, vector sampling, file transfer helpers
`testing.py`	Snapshot testing extensions for xarray and GeoPandas
`console.py`	Rich console instance for pretty terminal output

Deployment (`deploy/`)¶

Orchestration layer for local and cloud execution:

cli.py - Typer-based CLI application (ocr command) with commands for processing regions, aggregation, PMTiles generation, and analysis file creation
managers.py - Abstract batch manager interface with CoiledBatchManager (cloud) and LocalBatchManager (local) implementations

Pipeline (`pipeline/`)¶

Internal processing modules coordinated by the CLI. These implement the data processing workflow:

process_region.py - Sample risk values to building locations
partition.py - Partition GeoParquet by geographic regions
fire_wind_risk_regional_aggregator.py - Compute regional statistics with DuckDB
create_building_pmtiles.py - Generate PMTiles for building data visualization
create_regional_pmtiles.py - Generate region-specific PMTiles
write_aggregated_region_analysis_files.py - Write regional summary tables
write_per_region_analysis_files.py - Write per-region analysis outputs (CSV/JSON)

Risk models (`risks/`)¶

Domain-specific risk calculation logic:

fire.py - Fire/wind risk kernels, wind classification, elliptical spread models

Data management (`input-data/`)¶

Organized storage for input datasets and ingestion scripts:

Tensor data (`tensor/`)¶

conus404/ - CONUS404 climate reanalysis data
USFS_fire_risk/ - USFS fire risk rasters

Vector data (`vector/`)¶

aggregated_regions/ - Regional boundaries for aggregation
alexandre-2016/ - Historical fire perimeter data
calfire_stuctures_destroyed/ - Structure damage records
overture_vector/ - Building footprints from Overture Maps

Note

Raw data files are typically not committed. This directory contains ingestion scripts and metadata. Large datasets are stored on S3.

Research notebooks (`notebooks/`)¶

Exploratory Jupyter notebooks for prototyping and analysis:

conus404-winds.ipynb - Wind data exploration
elliptical_kernel.ipynb - Fire spread kernel development
fire-weather-wind-mode-reprojected.ipynb - Wind mode analysis
horizontal-scaling-via-spatial-chunking.ipynb - Performance optimization experiments
wind_spread.ipynb - Wind-driven fire spread modeling

Note

Convention: When a notebook reaches maturity and demonstrates stable workflows, consider converting it into a tutorial under docs/tutorials/.

Documentation (`docs/`)¶

Follows the Diátaxis framework for structured documentation:

docs/
├── how-to/              # Task-oriented guides
│   ├── installation.md
│   ├── getting-started.md
│   ├── snapshot-testing.md
│   ├── release-procedure.md
│   └── work-with-data.ipynb
├── tutorials/           # Learning-oriented lessons
│   ├── calculate-wind-adjusted-fire-risk-for-a-region.ipynb
│   └── data-pipeline.md
├── reference/           # Information-oriented technical specs
│   ├── api.md           # Auto-generated API docs
│   ├── data-schema.md
│   ├── data-downloads.md
│   ├── deployment.md
│   └── project-structure.md (this file)
├── methods/             # Explanation-oriented background
│   └── fire-risk/       # Scientific methodology
├── assets/              # Images, stylesheets, static files
└── index.md             # Documentation home page

Documentation is built with MkDocs Material and deployed automatically on merge to main.

Testing (`tests/`)¶

Comprehensive test suite with unit and integration tests:

File	Purpose
`conftest.py`	Pytest fixtures and configuration
`test_config.py`	Configuration model validation
`test_conus404.py`	CONUS404 data loading and transformations
`test_datasets.py`	Dataset catalog and access patterns
`test_managers.py`	Batch manager orchestration logic
`test_utils.py`	Utility function tests
`test_pipeline_snapshots.py`	Snapshot-based integration tests for pipeline outputs
`risks/`	Risk model tests

Test execution:

pixi run tests              # Unit tests only
pixi run tests-integration  # Integration tests (may require S3 access)

Configuration files¶

Package and environment¶

pyproject.toml - Project metadata, dependencies (managed by Pixi), build config, tool settings (ruff, pytest, coverage)
pixi.lock - Locked dependency versions for reproducible environments
environment.yaml - Conda environment export (auto-generated from Pixi for Coiled deployments)

Documentation¶

mkdocs.yml - MkDocs configuration: theme, plugins, navigation structure

Environment templates¶

ocr-local.env - Template for local development (uses local filesystem)
ocr-coiled-s3.env - Template for cloud execution (S3 backend)
ocr-coiled-s3-staging.env - Staging environment configuration
ocr-coiled-s3-production.env - Production environment configuration

Code quality¶

.pre-commit-config.yaml - Pre-commit hooks for linting and formatting
.prettierrc.json - Prettier configuration for Markdown/YAML formatting
codecov.yml - Code coverage reporting configuration

Infrastructure (`bucket_creation/`)¶

Helper scripts for cloud infrastructure setup:

create_s3_bucket.py - Script to create and configure S3 buckets with appropriate permissions and lifecycle policies

CI/CD (`.github/`)¶

GitHub Actions workflows for automated testing, building, and deployment:

workflows/ - CI/CD pipeline definitions (tests, linting, docs deployment, releases)
scripts/ - Helper scripts for environment export and Coiled software creation
copilot-instructions.md - Project-specific instructions for GitHub Copilot
dependabot.yaml - Automated dependency updates configuration
release-drafter.yml - Automated release notes generation

Development workflows¶

Adding new code¶

Create module under ocr/ (or in appropriate subpackage)
Add tests under tests/ (unit tests are required, integration tests for complex scenarios)
Update documentation:
- Add how-to guide if introducing new user-facing workflow
- Update API reference if adding public functions/classes
- Add method explanation if introducing new scientific approach

Adding new datasets¶

Create ingestion script under input-data/tensor/ or input-data/vector/
Register dataset in ocr.datasets catalog with metadata
Document provenance under docs/methods/ with sources, processing steps, and limitations
Add schema documentation to docs/reference/data-schema.md

Updating documentation¶

Choose appropriate section based on Diátaxis framework:
- How-to guides: task-oriented, assume prior knowledge
- Tutorials: learning-oriented, step-by-step for beginners
- Reference: information-oriented, technical specifications
- Methods: explanation-oriented, scientific background
Update navigation in mkdocs.yml if adding new pages
Test locally: pixi run docs-serve to preview changes
Submit PR: Documentation builds are tested in CI

Release workflow¶

See docs/how-to/release-procedure.md for detailed release instructions.

Project structure¶

Repository overview¶

Core package (ocr/)¶

Top-level modules¶

Deployment (deploy/)¶

Pipeline (pipeline/)¶

Risk models (risks/)¶

Data management (input-data/)¶

Tensor data (tensor/)¶

Vector data (vector/)¶

Research notebooks (notebooks/)¶

Documentation (docs/)¶

Testing (tests/)¶