如何沿 xarray.DataArray 的时间维度对每个图像使用 apply_ufunc 和 numpy.digitize?
How to use apply_ufunc with numpy.digitize for each image along time dimension of xarray.DataArray?
为了清楚起见,我对之前的问题进行了实质性改写。根据 Ryan 在单独频道上的建议,numpy.digitize 造型是实现我目标的正确工具。
我有一个 xarray.DataArray 的形状 x、y 和时间。我试图弄清楚我应该为 apply_ufunc
函数的 'input_core_dims' 和 'output_core_dims' 参数提供什么值,以便将 numpy.digitize
应用于时间序列中的每个图像。
直觉上,我希望输出维度为 ['time'、'x'、'y']。我认为输入核心维度应该是 x
和 y
因为我想沿着时间维度广播 numpy.digitize
函数。然而,这是行不通的。通过将 numpy.digitize 应用于我的时间序列中的第一个 numpy 数组,我得到了正确的结果:
[84]
blues
<xarray.DataArray 'reflectance' (time: 44, y: 1082, x: 1084)>
dask.array<shape=(44, 1082, 1084), dtype=uint16, chunksize=(44, 1082, 1084)>
Coordinates:
band int64 1
* y (y) float64 9.705e+05 9.705e+05 9.705e+05 ... 9.673e+05 9.672e+05
* x (x) float64 4.889e+05 4.889e+05 4.889e+05 ... 4.922e+05 4.922e+05
* time (time) datetime64[ns] 2018-10-12 2018-10-16 ... 2019-05-26
Attributes:
transform: (3.0, 0.0, 488907.0, 0.0, -3.0, 970494.0)
crs: +init=epsg:32630
res: (3.0, 3.0)
is_tiled: 1
nodatavals: (1.0, 1.0, 1.0, 1.0)
scales: (1.0, 1.0, 1.0, 1.0)
offsets: (0.0, 0.0, 0.0, 0.0)
[79]
#correct result
np.digitize(np.array(blues[0]), bin_arr)
array([[14, 15, 15, ..., 16, 17, 16],
[14, 13, 14, ..., 16, 16, 15],
[15, 14, 15, ..., 16, 16, 15],
...,
[16, 18, 18, ..., 15, 16, 15],
[17, 18, 18, ..., 16, 17, 16],
[17, 17, 17, ..., 17, 18, 17]])
但是我对apply_ufunc
的理解是不正确的。将 input_core_dims 更改为 [['x','y']] 或 ['time'] 不会产生正确的数字化结果
bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], dask="parallelized", output_dtypes=[blues.dtype])
#wrong values, correct shape
np.array(result)[0]
array([[14, 16, 15, ..., 48, 18, 15],
[15, 16, 16, ..., 49, 18, 15],
[15, 16, 16, ..., 49, 18, 14],
...,
[16, 21, 17, ..., 50, 19, 15],
[17, 21, 17, ..., 50, 19, 16],
[16, 21, 18, ..., 50, 20, 17]])
bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['x','y']], dask="parallelized", output_dtypes=[blues.dtype])
#wrong values, correct shape
np.array(result)[0]
array([[14, 14, 15, ..., 16, 17, 17],
[15, 13, 14, ..., 18, 18, 17],
[15, 14, 15, ..., 18, 18, 17],
...,
[16, 16, 16, ..., 15, 16, 17],
[17, 16, 16, ..., 16, 17, 18],
[16, 15, 15, ..., 15, 16, 17]])
这些结果中的每一个都具有正确的形状但值错误,这意味着将数字化函数应用于错误的轴并且结果被重塑为输入的形状。
同样奇怪的是,apply_ufunc
的结果在显示为 xarray 时删除了 input_core_dim。但在内部,当你将它转换为 numpy 数组时,维度仍然存在
[85]
result
<xarray.DataArray 'reflectance' (y: 1082, x: 1084)>
dask.array<shape=(1082, 1084), dtype=uint16, chunksize=(1082, 1084)>
Coordinates:
band int64 1
* y (y) float64 9.705e+05 9.705e+05 9.705e+05 ... 9.673e+05 9.672e+05
* x (x) float64 4.889e+05 4.889e+05 4.889e+05 ... 4.922e+05 4.922e+05
[87]
# the shape of the xarray and numpy array do not match after apply_ufunc
np.array(result).shape
(1082, 1084, 44)
此外,当我尝试将 output_core_dims 参数指定为 [['time', 'x', 'y']]
来更正此问题时,我收到一个错误,看起来您不能将一个维度同时作为输入核心维度和输出核心维度
[67]
bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], output_core_dims=[['time','x','y']], dask="parallelized", output_dtypes=[blues.dtype])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in
5 bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
6 blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
----> 7 result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], output_core_dims=[['time','x','y']], dask="parallelized", output_dtypes=[blues.dtype])
~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, *args)
967 join=join,
968 exclude_dims=exclude_dims,
--> 969 keep_attrs=keep_attrs)
970 elif any(isinstance(a, Variable) for a in args):
971 return variables_vfunc(*args)
~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
215
216 data_vars = [getattr(a, 'variable', a) for a in args]
--> 217 result_var = func(*data_vars)
218
219 if signature.num_outputs > 1:
~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, output_sizes, keep_attrs, *args)
539 if isinstance(arg, Variable)
540 else arg
--> 541 for arg, core_dims in zip(args, signature.input_core_dims)]
542
543 if any(isinstance(array, dask_array_type) for array in input_data):
~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in (.0)
539 if isinstance(arg, Variable)
540 else arg
--> 541 for arg, core_dims in zip(args, signature.input_core_dims)]
542
543 if any(isinstance(array, dask_array_type) for array in input_data):
~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in broadcast_compat_data(variable, broadcast_dims, core_dims)
493 'dimensions %r on an input variable: these are core '
494 'dimensions on other input or output variables'
--> 495 % unexpected_dims)
496
497 # for consistency with numpy, keep broadcast dimensions to the left
ValueError: operand to apply_ufunc encountered unexpected dimensions ['y', 'x'] on an input variable: these are core dimensions on other input or output variables
非常感谢任何帮助,我想了解我是如何滥用 input_core_dim 和 output_core_dim 参数的。
此解决方案不再适用问题的编辑方式!
您可能需要考虑新的 xhistogram 软件包。
Xhistogram makes it easier to calculate flexible, complex histograms with multi-dimensional data. It integrates (optionally) with Dask, in order to scale up to very large datasets and with Xarray, in order to consume and produce labelled, annotated data structures. It is useful for a wide range of scientific tasks.
它旨在解决您所面临的确切问题。
from xhistogram.xarray import histogram
import numpy as np
import xarray as xr
# create example image timeseries
ny, nx = 100, 100
nt = 44
data_arr = xr.DataArray(np.random.randn(nt,ny,nx),
dims=['time', 'y', 'x'],
name='blue reflectance')
# calculate histogram over spatial dimensions
rmin, rmax, nbins = -4, 4, 50
bin_arr = np.linspace(rmin, rmax, nbins)
histogram(data_arr, bins=[bin_arr], dim=['x','y'])
输出如下:
<xarray.DataArray 'histogram_blue reflectance' (time: 44, blue reflectance_bin: 49)>
array([[0, 0, 3, ..., 1, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 3, 0, 0],
...,
[0, 0, 1, ..., 1, 0, 0],
[0, 1, 3, ..., 0, 1, 1],
[0, 0, 3, ..., 2, 0, 1]])
Coordinates:
* blue reflectance_bin (blue reflectance_bin) float64 -3.918 -3.755 ... 3.918
Dimensions without coordinates: time
您想逐点应用 digitize
。这是 apply_ufunc
最简单的用例。不需要特殊参数。
Numpy 版本
import numpy as np
import xarray as xr
ny, nx = 100, 100
nt = 44
data = xr.DataArray(np.random.randn(nt,ny,nx),
dims=['time', 'y', 'x'],
name='blue reflectance')
rmin, rmax, nbins = -4, 4, 50
bins = np.linspace(rmin, rmax, nbins)
data_digitized = xr.apply_ufunc(np.digitize, data, bins)
这 returns 一个类似
的 DataArray
<xarray.DataArray 'blue reflectance' (time: 44, y: 100, x: 100)>
array([[[34, 17, ..., 27, 15],
....
[21, 24, ..., 23, 29]]])
Dimensions without coordinates: time, y, x
根据 numpy.digitize 文档中描述的约定,其中的值是 bin 索引。
Dask 版本
要使其在 dask 数组上延迟运行,您有两个选择
# create chunked dask version of data
data_chunked = data.chunk({'time': 1})
# use dask's version of digitize
import dask.array as da
xr.apply_ufunc(da.digitize, data_chunked, bins, dask='allowed')
# use xarray's built-in `parallelized` option on the numpy function
# (I needed to define a wrapper function to make this work,
# but I don't fully understand why.)
def wrap_digitize(data):
return np.digitize(data, bins)
xr.apply_ufunc(wrap_digitize, data_chunked,
dask='parallelized', output_dtypes=['i8'])
为了清楚起见,我对之前的问题进行了实质性改写。根据 Ryan 在单独频道上的建议,numpy.digitize 造型是实现我目标的正确工具。
我有一个 xarray.DataArray 的形状 x、y 和时间。我试图弄清楚我应该为 apply_ufunc
函数的 'input_core_dims' 和 'output_core_dims' 参数提供什么值,以便将 numpy.digitize
应用于时间序列中的每个图像。
直觉上,我希望输出维度为 ['time'、'x'、'y']。我认为输入核心维度应该是 x
和 y
因为我想沿着时间维度广播 numpy.digitize
函数。然而,这是行不通的。通过将 numpy.digitize 应用于我的时间序列中的第一个 numpy 数组,我得到了正确的结果:
[84]
blues
<xarray.DataArray 'reflectance' (time: 44, y: 1082, x: 1084)>
dask.array<shape=(44, 1082, 1084), dtype=uint16, chunksize=(44, 1082, 1084)>
Coordinates:
band int64 1
* y (y) float64 9.705e+05 9.705e+05 9.705e+05 ... 9.673e+05 9.672e+05
* x (x) float64 4.889e+05 4.889e+05 4.889e+05 ... 4.922e+05 4.922e+05
* time (time) datetime64[ns] 2018-10-12 2018-10-16 ... 2019-05-26
Attributes:
transform: (3.0, 0.0, 488907.0, 0.0, -3.0, 970494.0)
crs: +init=epsg:32630
res: (3.0, 3.0)
is_tiled: 1
nodatavals: (1.0, 1.0, 1.0, 1.0)
scales: (1.0, 1.0, 1.0, 1.0)
offsets: (0.0, 0.0, 0.0, 0.0)
[79]
#correct result
np.digitize(np.array(blues[0]), bin_arr)
array([[14, 15, 15, ..., 16, 17, 16],
[14, 13, 14, ..., 16, 16, 15],
[15, 14, 15, ..., 16, 16, 15],
...,
[16, 18, 18, ..., 15, 16, 15],
[17, 18, 18, ..., 16, 17, 16],
[17, 17, 17, ..., 17, 18, 17]])
但是我对apply_ufunc
的理解是不正确的。将 input_core_dims 更改为 [['x','y']] 或 ['time'] 不会产生正确的数字化结果
bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], dask="parallelized", output_dtypes=[blues.dtype])
#wrong values, correct shape
np.array(result)[0]
array([[14, 16, 15, ..., 48, 18, 15],
[15, 16, 16, ..., 49, 18, 15],
[15, 16, 16, ..., 49, 18, 14],
...,
[16, 21, 17, ..., 50, 19, 15],
[17, 21, 17, ..., 50, 19, 16],
[16, 21, 18, ..., 50, 20, 17]])
bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['x','y']], dask="parallelized", output_dtypes=[blues.dtype])
#wrong values, correct shape
np.array(result)[0]
array([[14, 14, 15, ..., 16, 17, 17],
[15, 13, 14, ..., 18, 18, 17],
[15, 14, 15, ..., 18, 18, 17],
...,
[16, 16, 16, ..., 15, 16, 17],
[17, 16, 16, ..., 16, 17, 18],
[16, 15, 15, ..., 15, 16, 17]])
这些结果中的每一个都具有正确的形状但值错误,这意味着将数字化函数应用于错误的轴并且结果被重塑为输入的形状。
同样奇怪的是,apply_ufunc
的结果在显示为 xarray 时删除了 input_core_dim。但在内部,当你将它转换为 numpy 数组时,维度仍然存在
[85]
result
<xarray.DataArray 'reflectance' (y: 1082, x: 1084)>
dask.array<shape=(1082, 1084), dtype=uint16, chunksize=(1082, 1084)>
Coordinates:
band int64 1
* y (y) float64 9.705e+05 9.705e+05 9.705e+05 ... 9.673e+05 9.672e+05
* x (x) float64 4.889e+05 4.889e+05 4.889e+05 ... 4.922e+05 4.922e+05
[87]
# the shape of the xarray and numpy array do not match after apply_ufunc
np.array(result).shape
(1082, 1084, 44)
此外,当我尝试将 output_core_dims 参数指定为 [['time', 'x', 'y']]
来更正此问题时,我收到一个错误,看起来您不能将一个维度同时作为输入核心维度和输出核心维度
[67]
bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], output_core_dims=[['time','x','y']], dask="parallelized", output_dtypes=[blues.dtype])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in
5 bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
6 blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
----> 7 result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], output_core_dims=[['time','x','y']], dask="parallelized", output_dtypes=[blues.dtype])
~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, *args)
967 join=join,
968 exclude_dims=exclude_dims,
--> 969 keep_attrs=keep_attrs)
970 elif any(isinstance(a, Variable) for a in args):
971 return variables_vfunc(*args)
~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
215
216 data_vars = [getattr(a, 'variable', a) for a in args]
--> 217 result_var = func(*data_vars)
218
219 if signature.num_outputs > 1:
~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, output_sizes, keep_attrs, *args)
539 if isinstance(arg, Variable)
540 else arg
--> 541 for arg, core_dims in zip(args, signature.input_core_dims)]
542
543 if any(isinstance(array, dask_array_type) for array in input_data):
~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in (.0)
539 if isinstance(arg, Variable)
540 else arg
--> 541 for arg, core_dims in zip(args, signature.input_core_dims)]
542
543 if any(isinstance(array, dask_array_type) for array in input_data):
~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in broadcast_compat_data(variable, broadcast_dims, core_dims)
493 'dimensions %r on an input variable: these are core '
494 'dimensions on other input or output variables'
--> 495 % unexpected_dims)
496
497 # for consistency with numpy, keep broadcast dimensions to the left
ValueError: operand to apply_ufunc encountered unexpected dimensions ['y', 'x'] on an input variable: these are core dimensions on other input or output variables
非常感谢任何帮助,我想了解我是如何滥用 input_core_dim 和 output_core_dim 参数的。
此解决方案不再适用问题的编辑方式!
您可能需要考虑新的 xhistogram 软件包。
Xhistogram makes it easier to calculate flexible, complex histograms with multi-dimensional data. It integrates (optionally) with Dask, in order to scale up to very large datasets and with Xarray, in order to consume and produce labelled, annotated data structures. It is useful for a wide range of scientific tasks.
它旨在解决您所面临的确切问题。
from xhistogram.xarray import histogram
import numpy as np
import xarray as xr
# create example image timeseries
ny, nx = 100, 100
nt = 44
data_arr = xr.DataArray(np.random.randn(nt,ny,nx),
dims=['time', 'y', 'x'],
name='blue reflectance')
# calculate histogram over spatial dimensions
rmin, rmax, nbins = -4, 4, 50
bin_arr = np.linspace(rmin, rmax, nbins)
histogram(data_arr, bins=[bin_arr], dim=['x','y'])
输出如下:
<xarray.DataArray 'histogram_blue reflectance' (time: 44, blue reflectance_bin: 49)>
array([[0, 0, 3, ..., 1, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 3, 0, 0],
...,
[0, 0, 1, ..., 1, 0, 0],
[0, 1, 3, ..., 0, 1, 1],
[0, 0, 3, ..., 2, 0, 1]])
Coordinates:
* blue reflectance_bin (blue reflectance_bin) float64 -3.918 -3.755 ... 3.918
Dimensions without coordinates: time
您想逐点应用 digitize
。这是 apply_ufunc
最简单的用例。不需要特殊参数。
Numpy 版本
import numpy as np
import xarray as xr
ny, nx = 100, 100
nt = 44
data = xr.DataArray(np.random.randn(nt,ny,nx),
dims=['time', 'y', 'x'],
name='blue reflectance')
rmin, rmax, nbins = -4, 4, 50
bins = np.linspace(rmin, rmax, nbins)
data_digitized = xr.apply_ufunc(np.digitize, data, bins)
这 returns 一个类似
的 DataArray<xarray.DataArray 'blue reflectance' (time: 44, y: 100, x: 100)>
array([[[34, 17, ..., 27, 15],
....
[21, 24, ..., 23, 29]]])
Dimensions without coordinates: time, y, x
根据 numpy.digitize 文档中描述的约定,其中的值是 bin 索引。
Dask 版本
要使其在 dask 数组上延迟运行,您有两个选择
# create chunked dask version of data
data_chunked = data.chunk({'time': 1})
# use dask's version of digitize
import dask.array as da
xr.apply_ufunc(da.digitize, data_chunked, bins, dask='allowed')
# use xarray's built-in `parallelized` option on the numpy function
# (I needed to define a wrapper function to make this work,
# but I don't fully understand why.)
def wrap_digitize(data):
return np.digitize(data, bins)
xr.apply_ufunc(wrap_digitize, data_chunked,
dask='parallelized', output_dtypes=['i8'])