feat: lazy dataset ingestion by lewisjared · Pull Request #515 · Climate-REF/climate-ref

lewisjared · 2026-02-05T11:47:38Z

Description

Implements lazy dataset ingestion to support millions of CMIP6 files on HPC parallel file systems without opening every netCDF at ingest time.

Two-phase approach:

Phase 1 (Bootstrap): The DRS parser extracts metadata from directory/filename structure only (finalised=False). A TABLE_ID_TO_FREQUENCY mapping infers frequency from table_id without file I/O.
Phase 2 (Lazy Finalization): At solve time, after filtering and grouping narrows candidates, only matched files are opened via the complete parser. Metadata is extracted, persisted to the DB, and cached so subsequent solves skip re-parsing.

Key components:

FinaliseableDatasetAdapterMixin — base class for adapters supporting two-phase ingestion
CMIP6DatasetAdapter.finalise_datasets() — opens unfinalised netCDFs, extracts metadata, applies fixes, persists to DB
DataCatalog wrapper — replaces raw pd.DataFrame in the solver with lazy loading (to_frame()) and per-group finalization (finalise(subset))
Solver integration — finalization happens after filter+group_by but before constraint checking

Also includes a refactor removing the unused config parameter from register_dataset.

Checklist

Please confirm that this pull request has done the following:

Tests added
Documentation added (where applicable)
Changelog item added to changelog/

The Config object was passed to register_dataset but never used inside the method. Removing it simplifies the API and makes the dependency on Database explicit.

Add a TABLE_ID_TO_FREQUENCY mapping so the DRS parser can determine frequency without opening netCDF files. This is a prerequisite for lazy dataset ingestion where full file I/O is deferred.

Introduce a mixin for adapters that support two-phase ingestion. CMIP6DatasetAdapter.finalise_datasets() opens unfinalised netCDF files, extracts full metadata, applies fixes, and persists back to the database.

…lver DataCatalog wraps the per-adapter catalog DataFrame with lazy loading from the database and per-group finalization of unfinalised datasets. The solver now finalizes datasets after filter+group_by but before constraint checking, so only matched candidates trigger file I/O.

Add unit tests for DataCatalog (lazy loading, cache, finalization paths), CMIP6DatasetAdapter.finalise_datasets (DRS-to-complete round trip), and solver integration (extract_covered_datasets with DataCatalog). Fix a bug where finalise_datasets did not convert start_time/end_time strings from the complete parser to datetime objects before persisting.

lewisjared · 2026-02-05T12:29:15Z

Replaces #369

Update documentation across tutorials, background, and how-to guides to reflect the two-phase ingestion approach where CMIP6 DRS parsing extracts metadata from paths only, with full metadata completed lazily at solve time.

lewisjared added 6 commits February 5, 2026 22:23

refactor: remove unused config param from register_dataset

e9c1c40

The Config object was passed to register_dataset but never used inside the method. Removing it simplifies the API and makes the dependency on Database explicit.

feat: infer frequency from table_id in DRS parser

51b952d

Add a TABLE_ID_TO_FREQUENCY mapping so the DRS parser can determine frequency without opening netCDF files. This is a prerequisite for lazy dataset ingestion where full file I/O is deferred.

feat: add FinaliseableDatasetAdapterMixin and implement on CMIP6 adapter

45c0f88

Introduce a mixin for adapters that support two-phase ingestion. CMIP6DatasetAdapter.finalise_datasets() opens unfinalised netCDF files, extracts full metadata, applies fixes, and persists back to the database.

docs: add changelog for lazy dataset ingestion

142c629

lewisjared changed the title ~~feat: lazy dataset ingestion with two-phase finalization~~ feat: lazy dataset ingestion Feb 5, 2026

lewisjared added 2 commits February 5, 2026 23:52

docs: update ingestion documentation for lazy finalization

f9fba90

Update documentation across tutorials, background, and how-to guides to reflect the two-phase ingestion approach where CMIP6 DRS parsing extracts metadata from paths only, with full metadata completed lazily at solve time.

docs: use British English spelling conventions

df08c0f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: lazy dataset ingestion#515

feat: lazy dataset ingestion#515
lewisjared wants to merge 8 commits intocmip7-data-requirementsfrom
lazy-dataset-ingestion

lewisjared commented Feb 5, 2026

Uh oh!

lewisjared commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lewisjared commented Feb 5, 2026

Description

Checklist

Uh oh!

lewisjared commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant