识别我使用 Dask 和 Xarray 的方法中的低效和瓶颈

Question

请帮助我确定我的方法中的瓶颈，如下所述：

我正在处理 2 个数据集，这些数据集在 10x10 公里的网格中具有大约 10 个变量的每日值。数据集有 65 年的历史，满载大小约为 140 GB。
不使用 Dask worker 和 xr.openmfdataset 中的“parallel=True”，数据集在大约 1 小时内加载到 HPC 中。在 xr.openmfdataset 中使用 10 个具有 10 个 CPU 和 parallel=True 的 Dask worker，数据集加载时间约为 15 分钟。
我正在尝试绘制给定 lat/long 和时间片的变量，这是事情再次变慢的地方。在给定位置切片和提取数据集后，绘制 7 个变量的平均时间约为 7 分钟。而且，我必须对大约 900 个位置迭代地执行此操作。
此外，随着每次迭代，绘图所花费的时间不断增加，内存占用也随之增加。例如，我在集群上有 5 个 CPU 的情况下在 7 小时内生成了 22 个图。

这是我设置 client=Client(n_workers=10) 后的代码示例：

`

lsmdat=xr.open_mfdataset('../SURFACEMODEL/*/*HIST*',combine = 'by_coords', parallel = True)
routedat=xr.open_mfdataset('../ROUTING/*/*HIST*',combine='by_coords' , parallel = True)

routedat_time = routedat.sel(time=slice('1980-01-01', '2014-12-31'))
routedat_loc = routedat_time.sel(lat=13.98,
                                  lon=75.69,
                                  method = 'nearest')
lsmdat_time = lsmdat.sel(time=slice('1980-01-01', '2014-12-31'))
lsmdat_loc = lsmdat_time.sel(lat=13.98,
                             lon=75.69,
                             method = 'nearest')


def subplotCreator(ncdat, ylab, plotnumber,label):
    ncdat.plot(ax=axarr[plotnumber],label = label)
    axarr[plotnumber].set_title('')
    axarr[plotnumber].set_xlabel('')
    axarr[plotnumber].set_ylabel(ylab, rotation='vertical',size=12)
    axarr[plotnumber].legend()
    axarr[plotnumber].yaxis.set_label_coords(-0.12, 0.5)

fig, axarr = plt.subplots(nrows=7,figsize=(10,20), sharex=True)

subplotCreator(lsmdat_loc['TotalPrecip_tavg'], 'Precipitation \n(kgm-2s-1)', 0,'')
subplotCreator(lsmdat_loc['Evap_tavg'], 'Surface Water Storage \n (mm)', 1,'')
subplotCreator(lsmdat_loc['SoilMoist_tavg'].sel(SoilMoist_profiles=0), 'Soil Moisture \n (kg m-2)', 2,'Layer 1')
subplotCreator(lsmdat_loc['SoilMoist_tavg'].sel(SoilMoist_profiles=1), 'Soil Moisture \n (kg m-2)', 2,'Layer 2')
subplotCreator(lsmdat_loc['SoilMoist_tavg'].sel(SoilMoist_profiles=2), 'Soil Moisture \n (kg m-2)', 2,'Layer 3')
subplotCreator(lsmdat_loc['SoilMoist_tavg'].sel(SoilMoist_profiles=3), 'Soil Moisture \n (kg m-2)', 2,'Layer 4')
subplotCreator(routedat_loc['FloodedFrac_tavg'], 'FloodFraction \n (-)', 3,'')
subplotCreator(routedat_loc['RiverDepth_tavg'], 'RiverDepth \n (m)', 4,'')
subplotCreator(routedat_loc['SWS_tavg'], 'Surface Water Storage \n (m)', 5,'')
subplotCreator(routedat_loc['Streamflow_tavg'], '', 6,'Simulated')
fig.tight_layout(rect=[0, 0, 1, 1])
plt.show()

Answer 1

为了帮助理解 Dask 应用程序的性能，我建议查看此文档页面：https://docs.dask.org/en/latest/understanding-performance.html

识别我使用 Dask 和 Xarray 的方法中的低效和瓶颈

Identifying the inefficiency and bottleneck in my approach with Dask and Xarray

python

netcdf

dask

python-xarray