Skip to content

Feat: Add a mechanism for Serializing any index - to_zarr on a big RangeIndex causes a crash #11034

@ianhi

Description

@ianhi

Is your feature request related to a problem?

As I understand it the current way that indexes are serialized is by writing the full contents of their coords to disk. This has some admirable properties in terms of interoperability with other tools and backcompatibility, but also some pitfalls.

1. Places burden on user's of xarray to remember re-install the index after loading + no mechanism for remembering what coords had set_xindex called on them.

A simple example:

import xarray as xr

ds = xr.Dataset(
    {"data": ("time", [0, 1, 2, 3])},
    coords={"time": [0.1, 0.2, 0.3, 0.4], "time_metadata": ("time", [10, 15, 20, 25])},
).set_xindex("time_metadata")
ds


ds.to_zarr("extra_index.zarr", mode="w")
roundtripped = xr.open_zarr("extra_index.zarr")

assert len(ds.xindexes) == len(roundtripped.xindexes)

fails with an assertion error.

2. For lazy indexes it's possible to crash your computer or run out of disk space with a simple to_zarr

For example, this snippet will create a 120 MiB zarr store when

import xarray as xr
step="1e-5"
idx1 = xr.indexes.RangeIndex.arange(0.0, 1000.0, float(step), dim="x")
ds1 = xr.Dataset(coords=xr.Coordinates.from_xindex(idx1))

ds1.to_zarr(f"float_range_step_{step}.zarr")

and I had to restart my laptop to get it unfrozen after running the below (offending line commented out) with the step value from the example (https://xarray-indexes.readthedocs.io/builtin/range.html)

import xarray as xr
step="1e-9"
idx1 = xr.indexes.RangeIndex.arange(0.0, 1000.0, float(step), dim="x")
ds1 = xr.Dataset(coords=xr.Coordinates.from_xindex(idx1))

# ds1.to_zarr(f"float_range_step_{step}.zarr")

Describe the solution you'd like

Allow indexes more control over how they are serialized and support a mechanism for automatically re-creating them if they are available in the current environment, with a graceful fallback.

Describe alternatives you've considered

If writing a custom index you could write your own loading function, but that would not solve the coord-bomb behavior and places more friction on the user.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions