如何将长度为 1 的 xarray DataArray 与更大的数组对齐?

How to align xarray DataArray with length-1 dimension with a larger array?

我想获取时间维度为 1 的 xarray 数据集,并简单地复制数据以将时间维度从 1 增加到 N。执行此操作的最有效方法是什么?我已经尝试了几种方法,例如 expand_dims 和堆栈,但其中 none 似乎可以满足我的要求。

最终我希望能够做到 moc10_H11 - moc_ctrl_clim 结果的维度与 moc10_H11 (35) 相同。现在,当我这样做时,输出的时间维度仅为 1。

为清楚起见,moc_ctrl_clim:

Dimensions:
time: 1, lat_aux_grid: 395, moc_z: 61
Coordinates: time (time) object 0001-01-01 00:00:00
lat_aux_grid (lat_aux_grid) float32 -79.49 -78.95 -78.42 ... 89.47 90.0
moc_z (moc_z) float32 0.0 1e+03 ... 5.25e+05 5.5e+05
Data variables:
MOC (time, moc_z, lat_aux_grid) float64
dask.array<chunksize=(1, 61, 395), meta=np.ndarray>

和moc10_H11有:

Dimensions:

time: 35, lat_aux_grid: 395, moc_z: 61
Coordinates: time (time) object 0001-01-01 00:00:00
lat_aux_grid (lat_aux_grid) float32 -79.49 -78.95 -78.42 ... 89.47 90.0
moc_z (moc_z) float32 0.0 1e+03 ... 5.25e+05 5.5e+05
Data variables:
MOC (time, moc_z, lat_aux_grid) float64
dask.array<chunksize=(1, 61, 395), meta=np.ndarray>

简短回答,压缩数据以便 xarray 的自动对齐规则生效:

da = da.squeeze(dim='time', drop=True)

现在,您可以与按时间索引的数组配对,数据将自动广播。

说明

这背后的原因在于numpy's broadcasting, which is based on shape, and xarray's broadcasting by dimension name.

之间的区别

Numpy 按形状广播

来自numpy docs:

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimensions and works its way left. Two dimensions are compatible when

  1. they are equal, or
  2. one of them is 1

例如,如果第一个维度对齐,您可以在列向量和数组之间执行逐元素加法:

In [3]: col_vector = np.ones(shape=(3, 1))

In [4]: col_vector
Out[4]:
array([[1.],
       [1.],
       [1.]])

In [5]: array = np.arange(12).reshape(3, 4)

In [6]: array
Out[6]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [7]: col_vector + array
Out[7]:
array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9., 10., 11., 12.]])

col_vector添加到array时,numpy识别col_vector沿轴1的长度为1,而array的长度为4,所以col_vector 应该在添加之前沿轴 1 广播(平铺)以具有长度 4。

xarray 按维度名称广播

来自xarray docs on computation

DataArray objects automatically align themselves (“broadcasting” in the numpy parlance) by dimension name instead of axis order. With xarray, you do not need to transpose arrays or insert dimensions of length 1 to get array operations to work, as commonly done in numpy with numpy.reshape() or numpy.newaxis.

除此之外,在 xarray docs on automatic alignment:

Xarray enforces alignment between index Coordinates (that is, coordinates with the same name as a dimension, marked by *) on objects used in binary operations. [...] If coordinate values for a dimension are missing on either argument, all matching dimensions must have the same size.

调整上面的例子不仅需要分配名称和坐标维度,还需要从列向量中删除第二个维度:

In [2]: vector = xr.DataArray(np.ones(shape=3), dims=['x'], coords=[[0, 1, 2]])

In [3]: vector
Out[3]:
<xarray.DataArray (x: 3)>
array([1., 1., 1.])
Coordinates:
  * x        (x) int64 0 1 2

In [4]: arr = xr.DataArray(
   ...:     np.arange(12).reshape(3, 4),
   ...:     dims=['x', 'time'],
   ...:     coords=[[0, 1, 2], pd.date_range('2020-01-01', periods=4, freq='D')],
   ...: )

In [5]: arr
Out[5]:
<xarray.DataArray (x: 3, time: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * x        (x) int64 0 1 2
  * time     (time) datetime64[ns] 2020-01-01 2020-01-02 2020-01-03 2020-01-04

In [6]: vector + arr
Out[6]:
<xarray.DataArray (x: 3, time: 4)>
array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9., 10., 11., 12.]])
Coordinates:
  * x        (x) int64 0 1 2
  * time     (time) datetime64[ns] 2020-01-01 2020-01-02 2020-01-03 2020-01-04

将 length-1 维度广播到更长的维度

在您的问题中,您有一个沿时间维度长度为 1 的数组,您想针对另一个具有更长时间坐标的数组进行广播。这在上面的例子中相当于在时间维度上有一个长度为 1 的“向量”:

In [7]: vector = xr.DataArray(
   ...:     np.ones(shape=(3, 1)),
   ...:     dims=['x', 'time'],
   ...:     coords=[[0, 1, 2], pd.date_range('2020-01-01', periods=1, freq='D')],
   ...: )

针对 arr 广播时,其时间维度长度为 4,仅保留交集:

In [8]: vector + arr
Out[8]:
<xarray.DataArray (x: 3, time: 1)>
array([[1.],
       [5.],
       [9.]])
Coordinates:
  * time     (time) datetime64[ns] 2020-01-01
  * x        (x) int64 0 1 2

通过da.squeeze首先压缩和删除时间暗淡,可以按时间广播数据:

In [9]: vector.squeeze('time', drop=True)  + arr
Out[9]:
<xarray.DataArray (x: 3, time: 4)>
array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9., 10., 11., 12.]])
Coordinates:
  * x        (x) int64 0 1 2
  * time     (time) datetime64[ns] 2020-01-01 2020-01-02 2020-01-03 2020-01-04

请注意,此方法会忽略第一个数组中 time 坐标中的信息,而是假设该信息适用于第二个数组中 time 的所有元素。如果这正是您要寻找的,那么按此处所示挤压和放下是可行的方法。