zarr 不考虑 xarray 的块大小并恢复到原始块大小

Question

我正在打开一个 zarr 文件，然后将其重新分块，然后将其写回到不同的 zarr 存储。然而，当我重新打开它时，它不遵守我之前写的块大小。这是代码和 jupyter 的输出。知道我在这里做错了什么吗？

bathy_ds = xr.open_zarr('data/bathy_store')
bathy_ds.elevation

bathy_ds.chunk(5000).elevation

bathy_ds.chunk(5000).to_zarr('data/elevation_store')
new_ds = xr.open_zarr('data/elevation_store')
new_ds.elevation

它正在恢复到原始分块，就好像我没有完全覆盖它或更改其他需要更改的设置一样。

Answer 1

这似乎是一个已知的 issue, and there's a fair bit of discussion going on within the issue's thread and a recently merged PR。

基本上，数据集在 .encoding 属性中进行原始分块。所以当你调用第二个write操作时，ds[var].encoding['chunks']中定义的chunks（如果有的话）会被用来写var到zarr.

根据GH issue中的对话，目前最好的解决办法是手动删除问题变量的chunk编码：

for var in ds:
    del ds[var].encoding['chunks']

但是，应该注意的是，这似乎是一个不断变化的情况，最好检查一下进度以调整最终解决方案。

这里有一个展示问题和解决方案的小例子：

import xarray as xr

# load data and write to initial chunking 
x = xr.tutorial.load_dataset("air_temperature")
x.chunk({"time":500, "lat":-1, "lon":-1}).to_zarr("zarr1.zarr")

# display initial chunking
xr.open_zarr("zarr1.zarr/").air

# rechunk
y = xr.open_zarr("zarr1.zarr/").chunk({"time": -1})

# display
y.air

#write w/o modifying .encoding
y.to_zarr("zarr2.zarr")

# display
xr.open_zarr("zarr2.zarr/").air

# delete encoding and store
del y.air.encoding['chunks']
y.to_zarr("zarr3.zarr")

# display
xr.open_zarr("zarr3.zarr/").air

zarr 不考虑 xarray 的块大小并恢复到原始块大小

zarr not respecting chunk size from xarray and reverting to original chunk size

python

python-xarray

zarr