Project structure¶
This page documents the OCR repository layout and explains the purpose of key directories and files. Use this as a technical reference when contributing code, adding datasets, or extending documentation.
Repository overview¶
The OCR platform is organized into distinct layers: the core Python package (ocr/), supporting infrastructure (configuration, deployment, testing), input data management, exploratory research notebooks, and comprehensive documentation. The structure follows best practices for scientific Python projects with emphasis on reproducibility, modularity, and cloud-native execution.
graph TB
subgraph Repository["OCR Repository"]
subgraph Core["Core Package"]
OCR[ocr/]
OCR --> CONFIG[config.py - Configuration models]
OCR --> TYPES[types.py - Type definitions]
OCR --> DATASETS[datasets.py - Data catalog]
OCR --> CONUS[conus404.py - Climate data]
OCR --> UTILS[utils.py - Utilities]
OCR --> TESTING[testing.py - Test helpers]
OCR --> DEPLOY[deploy/]
DEPLOY --> CLI[cli.py - CLI app]
DEPLOY --> MANAGERS[managers.py - Orchestration]
OCR --> PIPELINE[pipeline/]
PIPELINE --> PROCESS[process_region.py]
PIPELINE --> PARTITION[partition.py]
PIPELINE --> STATS[fire_wind_risk_regional_aggregator.py]
PIPELINE --> PMTILES[create_building_pmtiles.py + create_regional_pmtiles.py]
PIPELINE --> WRITERS[write_*_analysis_files.py]
OCR --> RISKS[risks/]
RISKS --> FIRE[fire.py - Risk models]
end
subgraph Data["Data & Inputs"]
INPUT[input-data/]
INPUT --> TENSOR[tensor/ - CONUS404, USFS fire risk]
INPUT --> VECTOR[vector/ - Buildings, regions, structures]
end
subgraph Research["Research & Exploration"]
NOTEBOOKS[notebooks/]
NOTEBOOKS --> NB1[Wind analysis notebooks]
NOTEBOOKS --> NB2[Fire risk kernels]
NOTEBOOKS --> NB3[Scaling experiments]
end
subgraph Docs["Documentation"]
DOCSDIR[docs/]
DOCSDIR --> HOWTO[how-to/ - Guides]
DOCSDIR --> METHODS[methods/ - Science docs]
DOCSDIR --> REFERENCE[reference/ - API & specs]
DOCSDIR --> TUTORIALS[tutorials/ - Workflows]
end
subgraph Testing["Testing & QA"]
TESTS[tests/]
TESTS --> UNIT[Unit tests]
TESTS --> INTEGRATION[Integration tests]
TESTS --> SNAPSHOTS[Snapshot tests]
end
subgraph Config["Configuration & Build"]
PYPROJECT[pyproject.toml - Package config]
PIXI[pixi.lock - Environment lock]
MKDOCS[mkdocs.yml - Docs config]
ENV[ocr-*.env - Environment vars]
GITHUB[.github/ - CI/CD workflows]
end
subgraph Infra["Infrastructure"]
BUCKET[bucket_creation/ - S3 setup]
end
end
style Core fill:#e1f5ff
style Data fill:#fff4e1
style Research fill:#f3e8ff
style Docs fill:#e8f5e9
style Testing fill:#ffebee
style Config fill:#fafafa
style Infra fill:#fff9c4
Core package (ocr/)¶
Contains all production code organized into logical modules:
Top-level modules¶
| Module | Purpose |
|---|---|
config.py |
Pydantic models for storage, chunking, Coiled, and processing configuration |
types.py |
Type definitions and enums (Environment, Platform, RiskType, RegionType) |
datasets.py |
Catalog abstraction for Zarr and GeoParquet datasets in S3 storage |
conus404.py |
CONUS404 climate data helpers: load variables, compute humidity, wind transformations |
utils.py |
DuckDB utilities, S3 secrets, vector sampling, file transfer helpers |
testing.py |
Snapshot testing extensions for xarray and GeoPandas |
console.py |
Rich console instance for pretty terminal output |
Deployment (deploy/)¶
Orchestration layer for local and cloud execution:
cli.py- Typer-based CLI application (ocrcommand) with commands for processing regions, aggregation, PMTiles generation, and analysis file creationmanagers.py- Abstract batch manager interface withCoiledBatchManager(cloud) andLocalBatchManager(local) implementations
Pipeline (pipeline/)¶
Internal processing modules coordinated by the CLI. These implement the data processing workflow:
process_region.py- Sample risk values to building locationspartition.py- Partition GeoParquet by geographic regionsfire_wind_risk_regional_aggregator.py- Compute regional statistics with DuckDBcreate_building_pmtiles.py- Generate PMTiles for building data visualizationcreate_regional_pmtiles.py- Generate region-specific PMTileswrite_aggregated_region_analysis_files.py- Write regional summary tableswrite_per_region_analysis_files.py- Write per-region analysis outputs (CSV/JSON)
Risk models (risks/)¶
Domain-specific risk calculation logic:
fire.py- Fire/wind risk kernels, wind classification, elliptical spread models
Data management (input-data/)¶
Organized storage for input datasets and ingestion scripts:
Tensor data (tensor/)¶
conus404/- CONUS404 climate reanalysis dataUSFS_fire_risk/- USFS fire risk rasters
Vector data (vector/)¶
aggregated_regions/- Regional boundaries for aggregationalexandre-2016/- Historical fire perimeter datacalfire_stuctures_destroyed/- Structure damage recordsoverture_vector/- Building footprints from Overture Maps
Note
Raw data files are typically not committed. This directory contains ingestion scripts and metadata. Large datasets are stored on S3.
Research notebooks (notebooks/)¶
Exploratory Jupyter notebooks for prototyping and analysis:
conus404-winds.ipynb- Wind data explorationelliptical_kernel.ipynb- Fire spread kernel developmentfire-weather-wind-mode-reprojected.ipynb- Wind mode analysishorizontal-scaling-via-spatial-chunking.ipynb- Performance optimization experimentswind_spread.ipynb- Wind-driven fire spread modeling
Note
Convention: When a notebook reaches maturity and demonstrates stable workflows, consider converting it into a tutorial under docs/tutorials/.
Documentation (docs/)¶
Follows the Diátaxis framework for structured documentation:
docs/
├── how-to/ # Task-oriented guides
│ ├── installation.md
│ ├── getting-started.md
│ ├── snapshot-testing.md
│ ├── release-procedure.md
│ └── work-with-data.ipynb
├── tutorials/ # Learning-oriented lessons
│ ├── calculate-wind-adjusted-fire-risk-for-a-region.ipynb
│ └── data-pipeline.md
├── reference/ # Information-oriented technical specs
│ ├── api.md # Auto-generated API docs
│ ├── data-schema.md
│ ├── data-downloads.md
│ ├── deployment.md
│ └── project-structure.md (this file)
├── methods/ # Explanation-oriented background
│ └── fire-risk/ # Scientific methodology
├── assets/ # Images, stylesheets, static files
└── index.md # Documentation home page
Documentation is built with MkDocs Material and deployed automatically on merge to main.
Testing (tests/)¶
Comprehensive test suite with unit and integration tests:
| File | Purpose |
|---|---|
conftest.py |
Pytest fixtures and configuration |
test_config.py |
Configuration model validation |
test_conus404.py |
CONUS404 data loading and transformations |
test_datasets.py |
Dataset catalog and access patterns |
test_managers.py |
Batch manager orchestration logic |
test_utils.py |
Utility function tests |
test_pipeline_snapshots.py |
Snapshot-based integration tests for pipeline outputs |
risks/ |
Risk model tests |
Test execution:
pixi run tests # Unit tests only
pixi run tests-integration # Integration tests (may require S3 access)
Configuration files¶
Package and environment¶
pyproject.toml- Project metadata, dependencies (managed by Pixi), build config, tool settings (ruff, pytest, coverage)pixi.lock- Locked dependency versions for reproducible environmentsenvironment.yaml- Conda environment export (auto-generated from Pixi for Coiled deployments)
Documentation¶
mkdocs.yml- MkDocs configuration: theme, plugins, navigation structure
Environment templates¶
ocr-local.env- Template for local development (uses local filesystem)ocr-coiled-s3.env- Template for cloud execution (S3 backend)ocr-coiled-s3-staging.env- Staging environment configurationocr-coiled-s3-production.env- Production environment configuration
Code quality¶
.pre-commit-config.yaml- Pre-commit hooks for linting and formatting.prettierrc.json- Prettier configuration for Markdown/YAML formattingcodecov.yml- Code coverage reporting configuration
Infrastructure (bucket_creation/)¶
Helper scripts for cloud infrastructure setup:
create_s3_bucket.py- Script to create and configure S3 buckets with appropriate permissions and lifecycle policies
CI/CD (.github/)¶
GitHub Actions workflows for automated testing, building, and deployment:
workflows/- CI/CD pipeline definitions (tests, linting, docs deployment, releases)scripts/- Helper scripts for environment export and Coiled software creationcopilot-instructions.md- Project-specific instructions for GitHub Copilotdependabot.yaml- Automated dependency updates configurationrelease-drafter.yml- Automated release notes generation
Development workflows¶
Adding new code¶
- Create module under
ocr/(or in appropriate subpackage) - Add tests under
tests/(unit tests are required, integration tests for complex scenarios) - Update documentation:
- Add how-to guide if introducing new user-facing workflow
- Update API reference if adding public functions/classes
- Add method explanation if introducing new scientific approach
Adding new datasets¶
- Create ingestion script under
input-data/tensor/orinput-data/vector/ - Register dataset in
ocr.datasetscatalog with metadata - Document provenance under
docs/methods/with sources, processing steps, and limitations - Add schema documentation to
docs/reference/data-schema.md
Updating documentation¶
- Choose appropriate section based on Diátaxis framework:
- How-to guides: task-oriented, assume prior knowledge
- Tutorials: learning-oriented, step-by-step for beginners
- Reference: information-oriented, technical specifications
- Methods: explanation-oriented, scientific background
- Update navigation in
mkdocs.ymlif adding new pages - Test locally:
pixi run docs-serveto preview changes - Submit PR: Documentation builds are tested in CI
Release workflow¶
See docs/how-to/release-procedure.md for detailed release instructions.