使用 xarray 和 Dask 将 1D netcdf 转换为 2D lat lon

Question

我有一个大型 netcdf 数据集，它有两个维度 - 'time' 和一个空间维度 'x'。每个 'x'.

也有一个 'lat' 和 'lon' 坐标

这需要映射到全局半度二维网格上，这样尺寸为 'time'、'lat 和 'lon'。并不是所有在全局半度网格上的点都在原始数据集中，因为原始数据集中只有陆地点，所以任何不在原始数据集中的点的值都应该为0.0.

原始数据集如下所示：

import xarray as xa
original_dataset = xa.load_dataset('original_dataset.nc')
print(original_dataset['snow_blow_gb'])

<xarray.DataArray 'snow_blow_gb' (time: 60630, x: 25993)>
[1575955590 values with dtype=float64]
Coordinates:
  * time       (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01
    latitude   (x) float32 83.75 83.75 83.75 83.75 ... 50.25 50.25 50.25 50.25
    longitude  (x) float32 -36.75 -36.25 -35.75 -35.25 ... 155.2 155.8 156.2
Dimensions without coordinates: x
Attributes:
    long_name:     Fraction of gridbox snowfall redistributed
    units:         1
    cell_methods:  time : mean

输出文件应如下所示（但当前为空）：

new_dataset = xa.open_dataset('new_dataset.nc')
print(new_dataset)

<xarray.Dataset>
Dimensions:       (time: 60630, lon: 720, lat: 360)
Coordinates:
  * time          (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01
  * lon           (lon) float32 -179.8 -179.2 -178.8 ... 178.8 179.2 179.8
  * lat           (lat) float32 89.75 89.25 88.75 88.25 ... -88.75 -89.25 -89.75
Data variables:
    snow_blow_gb  (time, lon, lat) float64 ...

我第一次尝试：

new_dataset['snow_blow_gb'].loc[{'lat':original_dataset['snow_blow_gb'].latitude,
           'lon':original_dataset['snow_blow_gb'].longitude}] = original_dataset['snow_blow_gb']

但是，没有足够的内存，所以我尝试使用 dask 并首先分块加载新数据集：

new_dataset = xa.open_dataset(f'new_dataset.nc', chunks = {'lat':36,'lon':72})
new_dataset['snow_blow_gb'].loc[{'lat':original_dataset['snow_blow_gb'].latitude,
           'lon':original_dataset['snow_blow_gb'].longitude}] = original_dataset['snow_blow_gb']

然后我发现您不能对 multi-dimensional indexing in dask 使用赋值。

使用for循环依次分配每个坐标点会花费很长时间，不太优雅，最终可能运行内存不足。

我该怎么办？

Answer 1

即使您的数据已网格化，但由于您的数据目前没有您希望在最终网格中看到的完整 lat/lon 组合集，您不能简单地重塑数据并需要首先显式重新索引数据。这可以很容易地通过分配一个 MultiIndex 代替堆叠索引，然后取消堆叠来完成。一个主要的警告是这会激增你的数据的大小，而且你需要确保你的数据是沿着 (not-stacked) 时间维度分块的。

在这个答案中，我假设您需要将数据作为普通输出数组。如果您可以使用 sparse arrays，这将节省大量 memory/storage，但需要不同的方法。

演练

设置 MRE

# set up grid spec
x = np.arange(-179.75, 180, 0.5)
y = np.arange(-89.75, 90, 0.5)
xx, yy = np.meshgrid(x, y)

# filter points to simulate your sparse data structure
dummy_land_mask = (np.random.random(size=(xx.shape)) > 0.9)
x_in_data = xx.ravel()[dummy_land_mask.flat]
y_in_data = yy.ravel()[dummy_land_mask.flat]

# construct random dask array with chunks along time only
arr = dda.random.random(
    (60630, len(x_in_data)),
    chunks=(1000, len(x_in_data)),
).astype('float32')

# build xr.Dataset with dims (time, x) and additional (lat, lon) coords
ds = xr.Dataset({
    'snow_blow_gb': xr.DataArray(
        arr,
        dims=['time', 'x'],
        coords={
            'time': pd.date_range('1850-01-02', freq='D', periods=60630),
            'latitude': (('x',), y_in_data),
            'longitude': (('x',), x_in_data),
        },
     ),
})

所以这看起来应该与您的数据相似。请注意，我仅沿时间对数据进行分块 - 重要的是，当您读入数据时 x 维度不会被分块，这样 dask 就不必跨块重塑：

In [28]: ds
Out[28]:
<xarray.Dataset>
Dimensions:       (time: 60630, x: 25928)
Coordinates:
  * time          (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01
    latitude      (x) float64 -89.75 -89.75 -89.75 -89.75 ... 89.75 89.75 89.75
    longitude     (x) float64 -172.8 -172.2 -163.2 -160.2 ... 167.8 169.2 179.2
Dimensions without coordinates: x
Data variables:
    snow_blow_gb  (time, x) float32 dask.array<chunksize=(1000, 25928), meta=np.ndarray>

您可以 re-assign x 坐标以包含 x 和 y 维度 pd.MultiIndex.from_arrays 给出 lat, lon 沿 x 维度的当前元素：

In [29]: lats = ds.latitude.values
    ...: lons = ds.longitude.values
    ...: ds = ds.drop(['latitude', 'longitude'])
    ...: ds.coords['x'] = pd.MultiIndex.from_arrays([lats, lons], names=['latitude', 'longitude'])

In [30]: ds
Out[30]:
<xarray.Dataset>
Dimensions:       (time: 60630, x: 25928)
Coordinates:
  * time          (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01
  * x             (x) MultiIndex
  - latitude      (x) float64 -89.75 -89.75 -89.75 -89.75 ... 89.75 89.75 89.75
  - longitude     (x) float64 -172.8 -172.2 -163.2 -160.2 ... 167.8 169.2 179.2
Data variables:
    snow_blow_gb  (time, x) float32 dask.array<chunksize=(1000, 25928), meta=np.ndarray>

现在，您可以展开 x 以获得完整的秩数组。请注意，这将产生一堆性能警告。这些是意料之中的，因为实际上您正在重塑以创建一个大块。您可以抑制这些（或忽略它们）或使用更小的块 - 由您决定。当前的分块方案将在输出数据集中产生大约 1GB 的块：

In [31]: reshaped = ds.unstack('x')
/opt/miniconda3/envs/myenv/lib/python3.10/site-packages/xarray/core/indexing.py:1228: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  return self.array[key]
/opt/miniconda3/envs/myenv/lib/python3.10/site-packages/xarray/core/dataset.py:4212: PerformanceWarning: Reshaping is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array.reshape(shape)

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array.reshape(shape)Explictly passing ``limit`` to ``reshape`` will also silence this warning
    >>> array.reshape(shape, limit='128 MiB')
  result = result._unstack_full_reindex(dim, fill_value, sparse)

In [32]: reshaped
Out[32]:
<xarray.Dataset>
Dimensions:       (time: 60630, latitude: 360, longitude: 720)
Coordinates:
  * time          (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01
  * latitude      (latitude) float64 -89.75 -89.25 -88.75 ... 88.75 89.25 89.75
  * longitude     (longitude) float64 -179.8 -179.2 -178.8 ... 178.8 179.2 179.8
Data variables:
    snow_blow_gb  (time, latitude, longitude) float32 dask.array<chunksize=(1000, 360, 720), meta=np.ndarray>

此时您可以进行清理，例如填充 NaN、将维度扩展到完整的 x、y 坐标集等。但重要的是，数据仍然以 1000 的块大小沿 time.请注意，此 MRE 将产生一个超过 58 GB 的未堆叠数据集（如果计算的话），其中只有原始 6 GB 是实际数据（其余的将是 NaN，您可以用零填充）。如果您的模型接受 sparse arrays 作为输入，这肯定是一种更有效的方法。

另外，我使用的是 float32 - 你的数据看起来像是 float64 - 如果你真的需要这个精度而不是你应该将我所有的大小估计值加倍（这样你将有一个 116GB 的最终产品）。如果没有，我建议在取消堆叠之前使用 ds['snow_blow_gb'] = ds['snow_blow_gb'].astype('float32') 进行转换。

使用 xarray 和 Dask 将 1D netcdf 转换为 2D lat lon

1D netcdf to 2D lat lon using xarray and Dask

python

coordinates

netcdf

dask

python-xarray

演练