如何沿 xarray.DataArray 的时间维度对每个图像使用 apply_ufunc 和 numpy.digitize?

How to use apply_ufunc with numpy.digitize for each image along time dimension of xarray.DataArray?

为了清楚起见,我对之前的问题进行了实质性改写。根据 Ryan 在单独频道上的建议,numpy.digitize 造型是实现我目标的正确工具。

我有一个 xarray.DataArray 的形状 x、y 和时间。我试图弄清楚我应该为 apply_ufunc 函数的 'input_core_dims' 和 'output_core_dims' 参数提供什么值,以便将 numpy.digitize 应用于时间序列中的每个图像。

直觉上,我希望输出维度为 ['time'、'x'、'y']。我认为输入核心维度应该是 xy 因为我想沿着时间维度广播 numpy.digitize 函数。然而,这是行不通的。通过将 numpy.digitize 应用于我的时间序列中的第一个 numpy 数组,我得到了正确的结果:

[84]

blues
<xarray.DataArray 'reflectance' (time: 44, y: 1082, x: 1084)>
dask.array<shape=(44, 1082, 1084), dtype=uint16, chunksize=(44, 1082, 1084)>
Coordinates:
    band     int64 1
  * y        (y) float64 9.705e+05 9.705e+05 9.705e+05 ... 9.673e+05 9.672e+05
  * x        (x) float64 4.889e+05 4.889e+05 4.889e+05 ... 4.922e+05 4.922e+05
  * time     (time) datetime64[ns] 2018-10-12 2018-10-16 ... 2019-05-26
Attributes:
    transform:   (3.0, 0.0, 488907.0, 0.0, -3.0, 970494.0)
    crs:         +init=epsg:32630
    res:         (3.0, 3.0)
    is_tiled:    1
    nodatavals:  (1.0, 1.0, 1.0, 1.0)
    scales:      (1.0, 1.0, 1.0, 1.0)
    offsets:     (0.0, 0.0, 0.0, 0.0)

[79]
#correct result
np.digitize(np.array(blues[0]), bin_arr)
array([[14, 15, 15, ..., 16, 17, 16],
       [14, 13, 14, ..., 16, 16, 15],
       [15, 14, 15, ..., 16, 16, 15],
       ...,
       [16, 18, 18, ..., 15, 16, 15],
       [17, 18, 18, ..., 16, 17, 16],
       [17, 17, 17, ..., 17, 18, 17]])

但是我对apply_ufunc的理解是不正确的。将 input_core_dims 更改为 [['x','y']] 或 ['time'] 不会产生正确的数字化结果

bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], dask="parallelized", output_dtypes=[blues.dtype])

#wrong values, correct shape
np.array(result)[0]

array([[14, 16, 15, ..., 48, 18, 15],
       [15, 16, 16, ..., 49, 18, 15],
       [15, 16, 16, ..., 49, 18, 14],
       ...,
       [16, 21, 17, ..., 50, 19, 15],
       [17, 21, 17, ..., 50, 19, 16],
       [16, 21, 18, ..., 50, 20, 17]])
bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['x','y']], dask="parallelized", output_dtypes=[blues.dtype])


#wrong values, correct shape
np.array(result)[0]

array([[14, 14, 15, ..., 16, 17, 17],
       [15, 13, 14, ..., 18, 18, 17],
       [15, 14, 15, ..., 18, 18, 17],
       ...,
       [16, 16, 16, ..., 15, 16, 17],
       [17, 16, 16, ..., 16, 17, 18],
       [16, 15, 15, ..., 15, 16, 17]])

这些结果中的每一个都具有正确的形状但值错误,这意味着将数字化函数应用于错误的轴并且结果被重塑为输入的形状。

同样奇怪的是,apply_ufunc 的结果在显示为 xarray 时删除了 input_core_dim。但在内部,当你将它转换为 numpy 数组时,维度仍然存在

[85]

result
<xarray.DataArray 'reflectance' (y: 1082, x: 1084)>
dask.array<shape=(1082, 1084), dtype=uint16, chunksize=(1082, 1084)>
Coordinates:
    band     int64 1
  * y        (y) float64 9.705e+05 9.705e+05 9.705e+05 ... 9.673e+05 9.672e+05
  * x        (x) float64 4.889e+05 4.889e+05 4.889e+05 ... 4.922e+05 4.922e+05

[87]
# the shape of the xarray and numpy array do not match after apply_ufunc
np.array(result).shape
(1082, 1084, 44) 

此外,当我尝试将 output_core_dims 参数指定为 [['time', 'x', 'y']] 来更正此问题时,我收到一个错误,看起来您不能将一个维度同时作为输入核心维度和输出核心维度

[67]

bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], output_core_dims=[['time','x','y']], dask="parallelized", output_dtypes=[blues.dtype])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
      5 bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
      6 blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
----> 7 result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], output_core_dims=[['time','x','y']], dask="parallelized", output_dtypes=[blues.dtype])

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, *args)
    967                                      join=join,
    968                                      exclude_dims=exclude_dims,
--> 969                                      keep_attrs=keep_attrs)
    970     elif any(isinstance(a, Variable) for a in args):
    971         return variables_vfunc(*args)

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
    215 
    216     data_vars = [getattr(a, 'variable', a) for a in args]
--> 217     result_var = func(*data_vars)
    218 
    219     if signature.num_outputs > 1:

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, output_sizes, keep_attrs, *args)
    539                   if isinstance(arg, Variable)
    540                   else arg
--> 541                   for arg, core_dims in zip(args, signature.input_core_dims)]
    542 
    543     if any(isinstance(array, dask_array_type) for array in input_data):

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in (.0)
    539                   if isinstance(arg, Variable)
    540                   else arg
--> 541                   for arg, core_dims in zip(args, signature.input_core_dims)]
    542 
    543     if any(isinstance(array, dask_array_type) for array in input_data):

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in broadcast_compat_data(variable, broadcast_dims, core_dims)
    493                          'dimensions %r on an input variable: these are core '
    494                          'dimensions on other input or output variables'
--> 495                          % unexpected_dims)
    496 
    497     # for consistency with numpy, keep broadcast dimensions to the left

ValueError: operand to apply_ufunc encountered unexpected dimensions ['y', 'x'] on an input variable: these are core dimensions on other input or output variables

非常感谢任何帮助,我想了解我是如何滥用 input_core_dim 和 output_core_dim 参数的。

此解决方案不再适用问题的编辑方式!

您可能需要考虑新的 xhistogram 软件包。

Xhistogram makes it easier to calculate flexible, complex histograms with multi-dimensional data. It integrates (optionally) with Dask, in order to scale up to very large datasets and with Xarray, in order to consume and produce labelled, annotated data structures. It is useful for a wide range of scientific tasks.

它旨在解决您所面临的确切问题。

from xhistogram.xarray import histogram 
import numpy as np
import xarray as xr

# create example image timeseries
ny, nx = 100, 100
nt = 44
data_arr = xr.DataArray(np.random.randn(nt,ny,nx),
                        dims=['time', 'y', 'x'],
                        name='blue reflectance')

# calculate histogram over spatial dimensions
rmin, rmax, nbins = -4, 4, 50
bin_arr = np.linspace(rmin, rmax, nbins)
histogram(data_arr, bins=[bin_arr], dim=['x','y'])

输出如下:

<xarray.DataArray 'histogram_blue reflectance' (time: 44, blue reflectance_bin: 49)>
array([[0, 0, 3, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 3, 0, 0],
       ...,
       [0, 0, 1, ..., 1, 0, 0],
       [0, 1, 3, ..., 0, 1, 1],
       [0, 0, 3, ..., 2, 0, 1]])
Coordinates:
  * blue reflectance_bin  (blue reflectance_bin) float64 -3.918 -3.755 ... 3.918
Dimensions without coordinates: time

您想逐点应用 digitize。这是 apply_ufunc 最简单的用例。不需要特殊参数。

Numpy 版本

import numpy as np
import xarray as xr

ny, nx = 100, 100
nt = 44
data = xr.DataArray(np.random.randn(nt,ny,nx),
                        dims=['time', 'y', 'x'],
                        name='blue reflectance')

rmin, rmax, nbins = -4, 4, 50
bins = np.linspace(rmin, rmax, nbins)

data_digitized = xr.apply_ufunc(np.digitize, data, bins)

这 returns 一个类似

的 DataArray
<xarray.DataArray 'blue reflectance' (time: 44, y: 100, x: 100)>
array([[[34, 17, ..., 27, 15],
         ....
        [21, 24, ..., 23, 29]]])
Dimensions without coordinates: time, y, x

根据 numpy.digitize 文档中描述的约定,其中的值是 bin 索引。

Dask 版本

要使其在 dask 数组上延迟运行,您有两个选择

# create chunked dask version of data
data_chunked = data.chunk({'time': 1})

# use dask's version of digitize
import dask.array as da
xr.apply_ufunc(da.digitize, data_chunked, bins, dask='allowed')

# use xarray's built-in `parallelized` option on the numpy function
# (I needed to define a wrapper function to make this work,
# but I don't fully understand why.)
def wrap_digitize(data):
    return np.digitize(data, bins)
xr.apply_ufunc(wrap_digitize, data_chunked,
               dask='parallelized', output_dtypes=['i8'])