-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Is your feature request related to a problem?
As I understand it the current way that indexes are serialized is by writing the full contents of their coords to disk. This has some admirable properties in terms of interoperability with other tools and backcompatibility, but also some pitfalls.
1. Places burden on user's of xarray to remember re-install the index after loading + no mechanism for remembering what coords had set_xindex called on them.
A simple example:
import xarray as xr
ds = xr.Dataset(
{"data": ("time", [0, 1, 2, 3])},
coords={"time": [0.1, 0.2, 0.3, 0.4], "time_metadata": ("time", [10, 15, 20, 25])},
).set_xindex("time_metadata")
ds
ds.to_zarr("extra_index.zarr", mode="w")
roundtripped = xr.open_zarr("extra_index.zarr")
assert len(ds.xindexes) == len(roundtripped.xindexes)fails with an assertion error.
2. For lazy indexes it's possible to crash your computer or run out of disk space with a simple to_zarr
For example, this snippet will create a 120 MiB zarr store when
import xarray as xr
step="1e-5"
idx1 = xr.indexes.RangeIndex.arange(0.0, 1000.0, float(step), dim="x")
ds1 = xr.Dataset(coords=xr.Coordinates.from_xindex(idx1))
ds1.to_zarr(f"float_range_step_{step}.zarr")and I had to restart my laptop to get it unfrozen after running the below (offending line commented out) with the step value from the example (https://xarray-indexes.readthedocs.io/builtin/range.html)
import xarray as xr
step="1e-9"
idx1 = xr.indexes.RangeIndex.arange(0.0, 1000.0, float(step), dim="x")
ds1 = xr.Dataset(coords=xr.Coordinates.from_xindex(idx1))
# ds1.to_zarr(f"float_range_step_{step}.zarr")Describe the solution you'd like
Allow indexes more control over how they are serialized and support a mechanism for automatically re-creating them if they are available in the current environment, with a graceful fallback.
Describe alternatives you've considered
If writing a custom index you could write your own loading function, but that would not solve the coord-bomb behavior and places more friction on the user.
Additional context
No response