如何使用 groupby 对 xarray 数据集进行下采样？

Question

我想根据特定组对 xarray 数据集进行下采样，因此我使用 groupby 到 select 该组，然后在每个组中取 10% 的样本。我正在使用下面的代码，但我得到 IndexError: index 1330 is out of bounds for axis 0 with size 1330 这表明我的函数正在返回一个空数组，但 subset 肯定具有非零维度。

我使用的是 squeeze=True，我认为它会根据 GroupBy documentation 允许新的维度，但这没有帮助，所以我将其更改为 squeeze=False.

你知道会发生什么吗？谢谢！

# Set random seed for reproducibility
np.random.seed(0)

def select_random_cell_subset(x):
    size = int(0.1 * len(x.cell))
    random_cells = sorted(np.random.choice(x.cell, size=size, replace=False))
    print('number of random cells:', len(random_cells))
    print('\tsome random cells:', random_cells[:5])
    subset = x.sel(cell=random_cells)
    print('subset:', subset)
    return subset

# squeeze=False because the final dataset is smaller than the original
ds_subset = ds.groupby('group', squeeze=True).apply(select_random_cell_subset)
ds_subset

这里是错误：

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-44-39c7803e9e40> in <module>()
     12 
     13 # squeeze=False because the final dataset is smaller than the original
---> 14 ds_subset = ds.groupby('group', squeeze=True).apply(select_random_cell_subset)
     15 ds_subset

~/anaconda3/envs/cshl-sca-2017/lib/python3.6/site-packages/xarray/core/groupby.py in apply(self, func, **kwargs)
    615         kwargs.pop('shortcut', None)  # ignore shortcut if set (for now)
    616         applied = (func(ds, **kwargs) for ds in self._iter_grouped())
--> 617         return self._combine(applied)
    618 
    619     def _combine(self, applied):

~/anaconda3/envs/cshl-sca-2017/lib/python3.6/site-packages/xarray/core/groupby.py in _combine(self, applied)
    622         coord, dim, positions = self._infer_concat_args(applied_example)
    623         combined = concat(applied, dim)
--> 624         combined = _maybe_reorder(combined, dim, positions)
    625         if coord is not None:
    626             combined[coord.name] = coord

~/anaconda3/envs/cshl-sca-2017/lib/python3.6/site-packages/xarray/core/groupby.py in _maybe_reorder(xarray_obj, dim, positions)
    443         return xarray_obj
    444     else:
--> 445         return xarray_obj[{dim: order}]
    446 
    447 

~/anaconda3/envs/cshl-sca-2017/lib/python3.6/site-packages/xarray/core/dataset.py in __getitem__(self, key)
    716         """
    717         if utils.is_dict_like(key):
--> 718             return self.isel(**key)
    719 
    720         if hashable(key):

~/anaconda3/envs/cshl-sca-2017/lib/python3.6/site-packages/xarray/core/dataset.py in isel(self, drop, **indexers)
   1141         for name, var in iteritems(self._variables):
   1142             var_indexers = dict((k, v) for k, v in indexers if k in var.dims)
-> 1143             new_var = var.isel(**var_indexers)
   1144             if not (drop and name in var_indexers):
   1145                 variables[name] = new_var

~/anaconda3/envs/cshl-sca-2017/lib/python3.6/site-packages/xarray/core/variable.py in isel(self, **indexers)
    568             if dim in indexers:
    569                 key[i] = indexers[dim]
--> 570         return self[tuple(key)]
    571 
    572     def squeeze(self, dim=None):

~/anaconda3/envs/cshl-sca-2017/lib/python3.6/site-packages/xarray/core/variable.py in __getitem__(self, key)
    398         dims = tuple(dim for k, dim in zip(key, self.dims)
    399                      if not isinstance(k, integer_types))
--> 400         values = self._indexable_data[key]
    401         # orthogonal indexing should ensure the dimensionality is consistent
    402         if hasattr(values, 'ndim'):

~/anaconda3/envs/cshl-sca-2017/lib/python3.6/site-packages/xarray/core/indexing.py in __getitem__(self, key)
    476     def __getitem__(self, key):
    477         key = self._convert_key(key)
--> 478         return self._ensure_ndarray(self.array[key])
    479 
    480     def __setitem__(self, key, value):

IndexError: index 1330 is out of bounds for axis 0 with size 1330

Answer 1

这是一件完全明智的事情，但遗憾的是它还没有奏效。 Xarray 使用一些启发式方法来决定 apply 操作是 reduce 还是 transform 类型，在这种情况下我们错误地将分组操作识别为 "transform" 因为输出重用原始维度名称。我只是 filed a bug report 但不幸的是，对 xarray 的修复会有些涉及。

可能最简单的解决方法是将应用函数 return 改为布尔值 DataArray，指示要保留的位置。然后你可以使用索引操作从原始对象select。

Answer 2

以下是我的实现方式。正如@shoyer 上面建议的那样，我为每个组返回了一个布尔值 xarray.DataArray，然后使用该布尔值对我的数据进行子集化。

# Set random seed for reproducibility
np.random.seed(0)

def select_random_cell_subset(x, threshold=0.1):
    random_bools = xr.DataArray(np.random.uniform(size=len(x.cell)) <= threshold,
                               coords=dict(cell=x.cell)) 
    return random_bools

    subset_bools = ds.groupby('group',).apply(select_random_cell_subset, 
                                                    threshold=0.1)
ds_subset = ds.sel(cell=subset_bools)

如何使用 groupby 对 xarray 数据集进行下采样？

How to downsample xarray dataset using groupby?

python

python-xarray

xarray