xarray：将 "insert" 时间片放入数据集或数据数组的最佳方法

Question

我有一个 3 维 xarray 数据集，其维度为 x、y 和 time。假设我知道在时间步长 n 处缺少观测值，插入没有数据值的时间片的最佳方法是什么？

这是一个工作示例：

import xarray as xr
import pandas as pd

x = xr.tutorial.load_dataset("air_temperature")

# assuming this is the missing point in time (currently not in the dataset)
missing = "2014-12-31T07:00:00"

# create an "empty" time slice with fillvalues
empty = xr.full_like(x.isel(time=0), -3000)

# fix the time coordinate of the timeslice
empty['time'] = pd.date_range(missing, periods=1)[0]

# before insertion
print(x.time[-5:].values)

# '2014-12-30T18:00:00.000000000' '2014-12-31T00:00:00.000000000'
#  '2014-12-31T06:00:00.000000000' '2014-12-31T12:00:00.000000000'
#  '2014-12-31T18:00:00.000000000']

# concat and sort time
x2 = xr.concat([x, empty], "time").sortby("time")

# after insertion
print(x2.time[-5:].values)

# ['2014-12-31T00:00:00.000000000' '2014-12-31T06:00:00.000000000'
#  '2014-12-31T07:00:00.000000000' '2014-12-31T12:00:00.000000000'
#  '2014-12-31T18:00:00.000000000']

该示例运行良好，但我不确定这是否是最好的（甚至是正确的）方法。

我担心的是将其用于更大的数据集，特别是 dask-array 支持的数据集。

是否有更好的方法来填充缺失的二维数组？插入到 dask 支持的数据集时，使用 dask 支持的“填充数组”会更好吗？

Answer 1

为此，您可以考虑使用 xarray 的 reindex 方法和常量 fill_value：

import numpy as np
import xarray as xr

x = xr.tutorial.load_dataset("air_temperature")
missing_time = np.datetime64("2014-12-31T07:00:00")
missing_time_da = xr.DataArray([missing_time], dims=["time"], coords=[[missing_time]])
full_time = xr.concat([x.time, missing_time_da], dim="time")
full = x.reindex(time=full_time, fill_value=-3000.0).sortby("time")

我认为如果 x 是 dask-backed，你的方法和 reindex 方法都会自动使用 dask-backed 数组。

xarray：将 "insert" 时间片放入数据集或数据数组的最佳方法

xarray: best way to "insert" a time slice into a dataset or dataarray

python

time-series

pandas

python-xarray