feat: lazy dataset ingestion#515
Open
lewisjared wants to merge 8 commits intocmip7-data-requirementsfrom
Open
Conversation
The Config object was passed to register_dataset but never used inside the method. Removing it simplifies the API and makes the dependency on Database explicit.
Add a TABLE_ID_TO_FREQUENCY mapping so the DRS parser can determine frequency without opening netCDF files. This is a prerequisite for lazy dataset ingestion where full file I/O is deferred.
Introduce a mixin for adapters that support two-phase ingestion. CMIP6DatasetAdapter.finalise_datasets() opens unfinalised netCDF files, extracts full metadata, applies fixes, and persists back to the database.
…lver DataCatalog wraps the per-adapter catalog DataFrame with lazy loading from the database and per-group finalization of unfinalised datasets. The solver now finalizes datasets after filter+group_by but before constraint checking, so only matched candidates trigger file I/O.
Add unit tests for DataCatalog (lazy loading, cache, finalization paths), CMIP6DatasetAdapter.finalise_datasets (DRS-to-complete round trip), and solver integration (extract_covered_datasets with DataCatalog). Fix a bug where finalise_datasets did not convert start_time/end_time strings from the complete parser to datetime objects before persisting.
Contributor
Author
|
Replaces #369 |
Update documentation across tutorials, background, and how-to guides to reflect the two-phase ingestion approach where CMIP6 DRS parsing extracts metadata from paths only, with full metadata completed lazily at solve time.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Implements lazy dataset ingestion to support millions of CMIP6 files on HPC parallel file systems without opening every netCDF at ingest time.
Two-phase approach:
finalised=False). ATABLE_ID_TO_FREQUENCYmapping infers frequency fromtable_idwithout file I/O.Key components:
FinaliseableDatasetAdapterMixin— base class for adapters supporting two-phase ingestionCMIP6DatasetAdapter.finalise_datasets()— opens unfinalised netCDFs, extracts metadata, applies fixes, persists to DBDataCatalogwrapper — replaces rawpd.DataFramein the solver with lazy loading (to_frame()) and per-group finalization (finalise(subset))Also includes a refactor removing the unused
configparameter fromregister_dataset.Checklist
Please confirm that this pull request has done the following:
changelog/