如何设置xarray.assign的输出坐标？

Question

我一直在尝试根据 xarray 数据集中数据点的纬度坐标创建两个新变量。但是，我似乎只能分配新的坐标。数据集如下所示：

<xarray.Dataset>
Dimensions:  (lon: 360, lat: 180, time: 412)
Coordinates:
  * lon      (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
  * lat      (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
  * time     (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Data variables:
    evapr    (time, lat, lon) float32 ...
    lhtfl    (time, lat, lon) float32 ...
...

到目前为止我尝试过的是：

def get_latitude_band(latitude):
    return np.select(
        condlist=
        [abs(latitude) < 23.45,
         abs(latitude) < 35,
         abs(latitude) < 66.55],
        choicelist=
        ["tropical",
         "sub_tropical",
         "temperate"],
        
        default="frigid"
    )

def get_hemisphere(latitude):
    return np.select(
        [latitude > 0, latitude <=0],
        ["north", "south"]
    )

    
mhw_data = mhw_data \
    .assign(climate_zone=get_latitude_band(mhw_data.lat)) \
    .assign(hemisphere=get_hemisphere(mhw_data.lat)) \
    .reset_index(["hemisphere", "climate_zone"]) \
    .reset_coords()
            
print(mhw_data)

哪个让我接近：

<xarray.Dataset>
Dimensions:        (lon: 360, lat: 180, time: 412, hemisphere: 180, climate_zone: 180)
Coordinates:
  * lon            (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
  * lat            (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
  * time           (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Dimensions without coordinates: hemisphere, climate_zone
Data variables:
    evapr          (time, lat, lon) float32 ...
    lhtfl          (time, lat, lon) float32 ...
    ...
    hemisphere_    (hemisphere) object 'south' 'south' ... 'north' 'north'
    climate_zone_  (climate_zone) object 'frigid' 'frigid' ... 'frigid' 'frigid'
...

但是，我想然后堆叠 DataSet 并将其转换为 DataFrame。我无法这样做，我认为这是因为新变量 hemisphere_ 和 climate_zone_ 没有 time, lat, lon 坐标:

stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T

在“lon”上产生 KeyError。

所以我的问题是：如何为保持时间、纬度和经度原始坐标的 xarray 数据集分配新变量？

Answer 1

要分配新的变量或坐标，xarray 需要知道维度的名称。有多种方法可以定义 DataArray 或 Coordinate，但最接近您当前使用的方法是提供 (dim_names, array):

的元组

mhw_data = mhw_data.assign_coords(
    climate_zone=(('lat', ), get_latitude_band(mhw_data.lat)),
    hemisphere=(('lat', ), get_hemisphere(mhw_data.lat)),
)

这里我使用的是 da.assign_coords, which will define climate_zone and hemisphere as non-dimension coordinates，您可以将其视为有关纬度和您的数据的附加元数据，但它们本身并不是正确的数据。这也将允许在将单个数组发送到 pandas.

时保留它们

为了堆叠，转换为pandas会自动堆叠。以下将 return 一个 DataFrame，其中 variables/non-dimension 坐标作为列，维度作为 MultiIndex：

stacked = mhw_data.to_dataframe()

或者，如果您想要一个由 (lat, lon, time) 索引的系列，您可以随时使用 expand_dims:

(
    mhw_data.climate_zone
    .expand_dims(lon=mhw_data.lon, time=mhw_data.time)
    .to_series()
)

Answer 2

我自己想出的两种可能的解决方案如下：

首先，将xarray数据堆叠成pandas个DataFrames，然后创建新列：

df = None
variables = list(mhw_data.data_vars)

for var in tqdm(variables): 
    
    stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
    if df is None:
        df = stacked
    else:
        df = pd.concat([df, stacked], axis=1)

df.reset_index(inplace=True)
df.columns = list(mhw_data.variables)

df["climate_zone"] = df["lat"].swifter.apply(get_latitude_band)
df["hemisphere"] = df["lat"].swifter.apply(get_hemisphere)

为您要添加的每个变量创建新的 xarray.DataArrays，然后将它们添加到数据集：

# calculate climate zone and hemisphere from latitude. 
latitudes = mhw_data.lat.values.reshape(-1, 1)

zones = get_latitude_band(latitudes)
hemispheres = get_hemisphere(latitudes)

# Take advantage of numpy broadcasting to get our data to lign up with the xarray shape. 
shape = tuple(mhw_data.sizes.values())
base = np.zeros(shape)

zones = zones + base
hemispheres = hemispheres + base

# finally, create two new DataArrays and assign them as variables in the dataset. 
zone_xarray = xr.DataArray(data=zones, coords=mhw_data.coords, dims=mhw_data.dims)
hemi_xarray = xr.DataArray(data=hemispheres, coords=mhw_data.coords, dims=mhw_data.dims)

mhw_data["zone"] = zone_xarray
mhw_data["hemisphere"] = hemi_xarray

# ... call the code to stack and convert to pandas (shown in method 1) ...#

我的直觉是方法 1 更快，内存效率更高，因为没有需要广播到大型 3 维数组中的重复值。但是，我没有对此进行测试。

此外，我的直觉是有一种更简单的 xarray 本地方法可以实现相同的目标，但我找不到它。

有一点是肯定的，方法一更简洁，因为不需要创建中间数组或重塑数据。

如何设置xarray.assign的输出坐标？

How to set the coordinates of the output of xarray.assign?

python

pandas

python-xarray