如何使用 xarray 按时间分组,然后 运行 分组上的 bin 函数?
How to use xarray to group by time and then run a bin function on the groups?
我有一个多维 'mean direction of total ocean swell' (mdts),netCDF 数据集。维度为 time
(以小时为单位)、latitude
和 longitude
。我只是希望按天对每小时数据进行分组,然后对于每一天,对于每个 lat/lon 网格,确定 16 个预定义的定向 bin 中的哪一个包含最多的小时数(最多可以是 24 个)。对于每个 lat/lon 网格,与具有最多小时数的 bin 关联的方向值随后将被指定为每个 lat/lon 网格的特定日期的方向。我正在将自定义函数应用于 groupby
命令,这就是发生错误的地方。我想我不明白传递给函数的是什么。
注:每个netCDF文件代表1979-2019一个月。因此,我使用 groupby
而不是 resample
,因为 resample
添加了文件中不存在的其他 11 个月份。我还首先将所有时间转换为 00:00,以便 groupby
可以按天分组。
注意:我的实际代码设置为循环遍历多个 netCDF 文件。我在这里简化了一个文件。
我的简化代码:
import numpy as np
import xarray as xr
ifile = 'mean_direction_total_swell_Nov_1979_2019_hourly.nc'
# min, max, and center values of angle direction bins
min = [348.75, 11.25, 33.75, 56.25, 78.75, 101.25, 123.75, 146.25, 168.75, 191.25, 213.75, 236.25, 258.75, 281.25, 303.75, 326.25]
max = [ 11.25, 33.75, 56.25, 78.75, 101.25, 123.75, 146.25, 168.75, 191.25, 213.75, 236.25, 258.75, 281.25, 303.75, 326.25, 348.75]
dir = [ 0.0, 22.5, 45.0, 67.5, 90.0, 112.5, 135.0, 157.5, 180.0, 202.5, 225.0, 247.5, 270.0, 292.5, 315.0, 337.5]
# custom function that I think is causing the problem
def bins(x):
bins_n = np.zeros([16], dtype=int)
# North bin requires 'or' statement
if(x >= min[0] or x < max[0]): bins_n[0] = bins_n[0] + 1
# other bins require 'and' statement
for i in range(1,16,1): # bins
if(x >= min[i] and x < max[i]):
bins_n[i] = bins_n[i] + 1
break
slot = np.argmax(bins_n)
return dir[slot]
idatanc = xr.open_dataset(ifile)
idata = idatanc['mdts']
idata.coords['time'] = idata.time.dt.floor('1D') # setting all hourly values to 0000
idata_dy = idata.groupby("time").apply(bins)
返回什么。注意:此错误是基于多个 netCDF 文件的循环程序,因此它可能与上面的代码不完全对应。错误还是一样。
Traceback (most recent call last):
File "<ipython-input-216-82adffe45690>", line 9, in <module>
idata_dy = idata.groupby("time").apply(bins)
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 815, in apply
return self.map(func, shortcut=shortcut, args=args, **kwargs)
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 800, in map
return self._combine(applied, shortcut=shortcut)
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 819, in _combine
applied_example, applied = peek_at(applied)
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\utils.py", line 183, in peek_at
peek = next(gen)
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 799, in <genexpr>
applied = (maybe_wrap_array(arr, func(arr, *args, **kwargs)) for arr in grouped)
File "<ipython-input-215-3d060f71ca15>", line 6, in bins
if(x >= min[0] or x < max[0]): bins_n[0] = bins_n[0] + 1
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\common.py", line 119, in __bool__
return bool(self.values)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
我没有一直检查结果,但我认为下面的代码可以满足您的需要:
import numpy as np
import xarray as xr
from scipy import stats
def func(x, axis):
mode, count = np.apply_along_axis(stats.mode, axis, x)
return mode.squeeze()
infile = 'mean_direction_total_swell_Nov_1979_2019_hourly.nc'
ds = xr.open_dataset(infile)
# make sure range is 0 <= x < 360
ds['mdts'] = np.mod(ds['mdts'], 360)
# bin the data in 16 directions (17 actually, North appears as the first and
# last bin)
step = 360 / 16
centers = np.r_[np.r_[0: 360: step], 0]
edges = np.r_[0, np.r_[step / 2: 360: step], 360]
ds['mdts_binned_idx'] = (ds['mdts'].dims, np.digitize(ds['mdts'], edges))
ds['mdts_binned'] = (ds['mdts'].dims, centers[ds['mdts_binned_idx'] - 1])
# apply stats.mode to get the modal (most common) value in each day
ds2 = xr.Dataset()
ds2['mdts_mode_1d'] = ds['mdts_binned'].resample(time='1D').reduce(func)
我有一个多维 'mean direction of total ocean swell' (mdts),netCDF 数据集。维度为 time
(以小时为单位)、latitude
和 longitude
。我只是希望按天对每小时数据进行分组,然后对于每一天,对于每个 lat/lon 网格,确定 16 个预定义的定向 bin 中的哪一个包含最多的小时数(最多可以是 24 个)。对于每个 lat/lon 网格,与具有最多小时数的 bin 关联的方向值随后将被指定为每个 lat/lon 网格的特定日期的方向。我正在将自定义函数应用于 groupby
命令,这就是发生错误的地方。我想我不明白传递给函数的是什么。
注:每个netCDF文件代表1979-2019一个月。因此,我使用 groupby
而不是 resample
,因为 resample
添加了文件中不存在的其他 11 个月份。我还首先将所有时间转换为 00:00,以便 groupby
可以按天分组。
注意:我的实际代码设置为循环遍历多个 netCDF 文件。我在这里简化了一个文件。 我的简化代码:
import numpy as np
import xarray as xr
ifile = 'mean_direction_total_swell_Nov_1979_2019_hourly.nc'
# min, max, and center values of angle direction bins
min = [348.75, 11.25, 33.75, 56.25, 78.75, 101.25, 123.75, 146.25, 168.75, 191.25, 213.75, 236.25, 258.75, 281.25, 303.75, 326.25]
max = [ 11.25, 33.75, 56.25, 78.75, 101.25, 123.75, 146.25, 168.75, 191.25, 213.75, 236.25, 258.75, 281.25, 303.75, 326.25, 348.75]
dir = [ 0.0, 22.5, 45.0, 67.5, 90.0, 112.5, 135.0, 157.5, 180.0, 202.5, 225.0, 247.5, 270.0, 292.5, 315.0, 337.5]
# custom function that I think is causing the problem
def bins(x):
bins_n = np.zeros([16], dtype=int)
# North bin requires 'or' statement
if(x >= min[0] or x < max[0]): bins_n[0] = bins_n[0] + 1
# other bins require 'and' statement
for i in range(1,16,1): # bins
if(x >= min[i] and x < max[i]):
bins_n[i] = bins_n[i] + 1
break
slot = np.argmax(bins_n)
return dir[slot]
idatanc = xr.open_dataset(ifile)
idata = idatanc['mdts']
idata.coords['time'] = idata.time.dt.floor('1D') # setting all hourly values to 0000
idata_dy = idata.groupby("time").apply(bins)
返回什么。注意:此错误是基于多个 netCDF 文件的循环程序,因此它可能与上面的代码不完全对应。错误还是一样。
Traceback (most recent call last):
File "<ipython-input-216-82adffe45690>", line 9, in <module>
idata_dy = idata.groupby("time").apply(bins)
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 815, in apply
return self.map(func, shortcut=shortcut, args=args, **kwargs)
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 800, in map
return self._combine(applied, shortcut=shortcut)
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 819, in _combine
applied_example, applied = peek_at(applied)
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\utils.py", line 183, in peek_at
peek = next(gen)
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 799, in <genexpr>
applied = (maybe_wrap_array(arr, func(arr, *args, **kwargs)) for arr in grouped)
File "<ipython-input-215-3d060f71ca15>", line 6, in bins
if(x >= min[0] or x < max[0]): bins_n[0] = bins_n[0] + 1
File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\common.py", line 119, in __bool__
return bool(self.values)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
我没有一直检查结果,但我认为下面的代码可以满足您的需要:
import numpy as np
import xarray as xr
from scipy import stats
def func(x, axis):
mode, count = np.apply_along_axis(stats.mode, axis, x)
return mode.squeeze()
infile = 'mean_direction_total_swell_Nov_1979_2019_hourly.nc'
ds = xr.open_dataset(infile)
# make sure range is 0 <= x < 360
ds['mdts'] = np.mod(ds['mdts'], 360)
# bin the data in 16 directions (17 actually, North appears as the first and
# last bin)
step = 360 / 16
centers = np.r_[np.r_[0: 360: step], 0]
edges = np.r_[0, np.r_[step / 2: 360: step], 360]
ds['mdts_binned_idx'] = (ds['mdts'].dims, np.digitize(ds['mdts'], edges))
ds['mdts_binned'] = (ds['mdts'].dims, centers[ds['mdts_binned_idx'] - 1])
# apply stats.mode to get the modal (most common) value in each day
ds2 = xr.Dataset()
ds2['mdts_mode_1d'] = ds['mdts_binned'].resample(time='1D').reduce(func)