xarray 将单个值分配给一个 variable/dataArray 最终分配给所有 variables/dataArray

xarray assigning individual values to one variable/dataArray ends up assigning to all variables/dataArray

我有一个脚本,我在其中创建了一个充满 np.nan 的大型 xarray 数据集,然后在循环中分配各个值,使用 .loc(我也尝试过位置索引)(doc)

我觉得有些奇怪。

这是我的最小可重现示例:

import xarray as xr
import numpy as np

levels = np.arange(0,3)
simNames = ['9airports_filter0dot7_v22']
airportList = ['Windhoek', 'Atlanta', 'Taipei']

emptyDA = xr.DataArray(np.nan, coords = [simNames, airportList, np.arange(0, 20428), levels], 
                       dims = ['simName', 'airport', 'profnum', 'level'])

ds = xr.Dataset({
    'iasi': emptyDA,
    'IM':   emptyDA,
    'IMS': emptyDA,
    'err': emptyDA,
    'sigma': emptyDA,
    'temp': emptyDA, 
    'dfs': emptyDA, 
    'ocf': emptyDA, 
    'rcf': emptyDA, 
    'time': emptyDA, 
    'surfPres': emptyDA })

ds = ds.assign_coords(time = ds.time) # pass time from variable to coord

ds['dfs'].loc['9airports_filter0dot7_v22', 'Windhoek', 0, 0] = 3

我将标量“3”分配给所有数据数组:

<xarray.Dataset>
Dimensions:   (simName: 1, airport: 3, profnum: 20428, level: 3)
Coordinates:
  * simName   (simName) <U25 '9airports_filter0dot7_v22'
  * airport   (airport) <U8 'Windhoek' 'Atlanta' 'Taipei'
  * profnum   (profnum) int64 0 1 2 3 4 5 ... 20423 20424 20425 20426 20427
  * level     (level) int64 0 1 2
Data variables:
    iasi      (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    IM        (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    IMS       (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    err       (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    sigma     (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    temp      (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    dfs       (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    ocf       (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    rcf       (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    time      (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    surfPres  (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan 

虽然这个更简单的代码可以正常工作:

import xarray as xr
import numpy as np

ds = xr.Dataset({'var1': (('x', 'y'), [[np.nan, np.nan],[np.nan, np.nan]]), 'var2': (('x', 'y'), [[np.nan, np.nan], [np.nan, np.nan]])})

ds['var1'].loc[0, 0] = 1

okey,我理解了我的错误:没有为每个新变量复制 emptyDA,而是指向同一个对象。插入 emptyDA.copy() 而不是 emptyDA 可以解决问题。我认为 xarray 对象的创建会复制数据。 感谢您的帮助

出现此问题是因为当您使用 DataArrays 字典初始化 xarray.Dataset 时,它会生成 DataArrays 的浅表副本,允许每个具有不同的元数据但不会复制底层的 numpy 数组。

您可以根据您的问题通过一个小示例来了解此行为。

首先,我将创建一个包含所有 NaN 的新 numpy 数组:

In [1]: import xarray as xr, numpy as np, pandas as pd

In [2]: np_arr = np.array([np.nan, np.nan, np.nan, np.nan])

In [3]: np_arr
Out[3]: array([nan, nan, nan, nan])

我们可以在这里看到实际的内存地址ID:

In [4]: hex(id(np_arr))
Out[4]: '0x1186570f0'

记住这个地址 - 我们会回来的:'0x1186570f0'

接下来我们将创建一个 DataArray 来包装这个 numpy 数组:

In [5]: da = xr.DataArray(np_arr, dims=['x'], coords=[range(4)])

In [6]: da
Out[6]:
<xarray.DataArray (x: 4)>
array([nan, nan, nan, nan])
Coordinates:
  * x        (x) int64 0 1 2 3

DataArray 本身获得了一个新 ID,但底层数组只是指向 '0x1186570f0':

处的同一个 numpy 对象
In [7]: hex(id(da))
Out[7]: '0x118668460'

In [8]: hex(id(da.data))
Out[8]: '0x1186570f0'

当您使用 DataArray 字典初始化数据集时,xarray 会生成数组的浅表副本。请注意,对 DataArray 地址的引用已更改:

In [9]: ds = xr.Dataset({'var1': da, 'var2': da})

In [10]: hex(id(ds['var1']))
Out[10]: '0x1186d5340'

In [11]: hex(id(ds['var2']))
Out[11]: '0x1186e0fa0'

这允许每个数组有不同的attributes/metadata

In [12]: ds['var1'].name
Out[12]: 'var1'

In [13]: ds['var2'].name
Out[13]: 'var2'

但是,数据仍然指向原来的numpy地址:

In [14]: hex(id(ds['var1'].data))
Out[14]: '0x1186570f0'

In [15]: hex(id(ds['var2'].data))
Out[15]: '0x1186570f0'

这是一件好事,因为这意味着使用 xarray 不会破坏您的内存使用,除非您告诉它这样做。但是如果您愿意,您必须告诉它复制数据。

您可以使用深层复制来做到这一点,xarray.DataArray.copy 默认情况下这样做:

In [16]: ds = xr.Dataset({'var1': da.copy(), 'var2': da.copy()})

In [17]: hex(id(ds['var1'].data))
Out[17]: '0x1186b23f0'

In [18]: hex(id(ds['var2'].data))
Out[18]: '0x118660090'