Python xarray 的 open_mfdataset 以意想不到的方式工作
Python xarray's open_mfdataset works in an unexpected way
我有 365 个 1980 年的每日 netCDF 文件。这些文件位于一个包含多年 (1979-2013) 数据的文件夹中。
当我打开 1980 文件时,
files = glob.glob("/mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/Subset/data_1980*")
ds = xarray.open_mfdataset(files, engine="netcdf4")
时间戳似乎不正确。当我打印出时间时,我得到以下信息:
ds.time.sortby("time")
Out[28]:
<xarray.DataArray 'time' (time: 3286)>
array(['1979-01-07T00:00:00.000000000', '1979-01-07T03:00:00.000000000',
'1979-01-07T06:00:00.000000000', ..., '2013-12-23T18:00:00.000000000',
'2013-12-23T21:00:00.000000000', '2013-12-24T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 1979-01-07 1979-01-07T03:00:00 ...
Attributes:
standard_name: time
axis: T
为了检查文件夹中的其他文件是否被读取,我更改了文件夹的内容(即我删除了 2012 年的文件),但我仍然得到与以前相同的时间序列。我不确定出了什么问题!
Out[29]:
<xarray.DataArray 'time' (time: 3286)>
array(['1979-01-07T00:00:00.000000000', '1979-01-07T03:00:00.000000000',
'1979-01-07T06:00:00.000000000', ..., '2013-12-23T18:00:00.000000000',
'2013-12-23T21:00:00.000000000', '2013-12-24T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 1979-01-07 1979-01-07T03:00:00 ...
Attributes:
standard_name: time
axis: T
NetCDF 数据具有如下元数据(使用 ncdump -h):
svimal@lettenmaierlab06:/mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/Subset$ ncdump -h data_19800530.nc
netcdf data_19800530 {
dimensions:
lon = 503 ;
lat = 170 ;
time = UNLIMITED ; // (1 currently)
variables:
double lon(lon) ;
lon:standard_name = "longitude" ;
lon:long_name = "longitude" ;
lon:units = "degrees_east" ;
lon:axis = "X" ;
double lat(lat) ;
lat:standard_name = "latitude" ;
lat:long_name = "latitude" ;
lat:units = "degrees_north" ;
lat:axis = "Y" ;
double time(time) ;
time:standard_name = "time" ;
time:units = "hours since 1999-5-16 00:00:00" ;
time:calendar = "standard" ;
time:axis = "T" ;
float air_temp(time, lat, lon) ;
air_temp:long_name = "air temperuature (C)" ;
air_temp:_FillValue = -9.99e+08f ;
air_temp:missing_value = -9.99e+08f ;
float vp(time, lat, lon) ;
vp:long_name = "vapor pressure (kPa)" ;
vp:_FillValue = -9.99e+08f ;
vp:missing_value = -9.99e+08f ;
float pressure(time, lat, lon) ;
pressure:long_name = "pressure (kPa)" ;
pressure:_FillValue = -9.99e+08f ;
pressure:missing_value = -9.99e+08f ;
float windspd(time, lat, lon) ;
windspd:long_name = "wind (m/s)" ;
windspd:_FillValue = -9.99e+08f ;
windspd:missing_value = -9.99e+08f ;
float shortwave(time, lat, lon) ;
shortwave:long_name = "downward shortwave (W/m^2)" ;
shortwave:_FillValue = -9.99e+08f ;
shortwave:missing_value = -9.99e+08f ;
float longwave(time, lat, lon) ;
longwave:long_name = "downward longwave (W/m^2)" ;
longwave:_FillValue = -9.99e+08f ;
longwave:missing_value = -9.99e+08f ;
float precip(time, lat, lon) ;
precip:long_name = "precipitation (mm/hr)" ;
precip:_FillValue = -9.99e+08f ;
precip:missing_value = -9.99e+08f ;
// global attributes:
:CDI = "Climate Data Interface version ?? (http://mpimet.mpg.de/cdi)" ;
:Conventions = "CF-1.4" ;
:history = "Tue Mar 20 14:36:48 2018: ncea -d lat,41.375,83.625 -d lon,181.375,306.875 /mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/data_19800530.nc /mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/Subset/data_19800530.nc\n",
"Tue Mar 20 14:36:44 2018: cdo -f nc import_binary /mnt/nfs/home/solomon/Data/CFSR/CFSR-LAND_Global_0.25deg_data_changed.ctl /mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/data_19800530.nc" ;
:CDO = "Climate Data Operators version 1.7.0 (http://mpimet.mpg.de/cdo)" ;
:NCO = "\"4.5.4\"" ;
:nco_openmp_thread_number = 1 ;
}
时间属性表示
time = UNLIMITED ; // (1 currently)
我不确定这是什么意思,这可能是问题所在吗?
您确定您对 glob.glob()
的使用仅返回带有 1980 年时间的 netCDF 文件吗?
我建议 spot-check 带有显式循环(跳过 open_mfdataset
):
files = glob.glob("/mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/Subset/data_1980*")
for path in files:
ds = xarray.open_dataset(path, engine="netcdf4")
print(path, ds.time.values)
旁注:最好将 glob 字符串直接传递到 open_mfdataset()
而不是显式调用 glob.glob()
。它更简洁一点,xarray 还在它解析的 glob 字符串上调用 sorted()
,而不是依赖于 glob.glob()
.
返回的列表的平台特定顺序
谢谢@shoyer!问题出在我的文件上。以“1980”开头的文件名包含其他年份的数据。发生这种情况是因为我修改了相同的输入控制文件以并行创建多个 netcdf 文件,使用:
cdo -f nc import_binary CFSR-LAND_Global_0.25deg_data_changed.ctl data_19800530.nc
为每个并行线程创建唯一的 ctl 文件解决了这个问题。
我有 365 个 1980 年的每日 netCDF 文件。这些文件位于一个包含多年 (1979-2013) 数据的文件夹中。
当我打开 1980 文件时,
files = glob.glob("/mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/Subset/data_1980*")
ds = xarray.open_mfdataset(files, engine="netcdf4")
时间戳似乎不正确。当我打印出时间时,我得到以下信息:
ds.time.sortby("time")
Out[28]:
<xarray.DataArray 'time' (time: 3286)>
array(['1979-01-07T00:00:00.000000000', '1979-01-07T03:00:00.000000000',
'1979-01-07T06:00:00.000000000', ..., '2013-12-23T18:00:00.000000000',
'2013-12-23T21:00:00.000000000', '2013-12-24T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 1979-01-07 1979-01-07T03:00:00 ...
Attributes:
standard_name: time
axis: T
为了检查文件夹中的其他文件是否被读取,我更改了文件夹的内容(即我删除了 2012 年的文件),但我仍然得到与以前相同的时间序列。我不确定出了什么问题!
Out[29]:
<xarray.DataArray 'time' (time: 3286)>
array(['1979-01-07T00:00:00.000000000', '1979-01-07T03:00:00.000000000',
'1979-01-07T06:00:00.000000000', ..., '2013-12-23T18:00:00.000000000',
'2013-12-23T21:00:00.000000000', '2013-12-24T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 1979-01-07 1979-01-07T03:00:00 ...
Attributes:
standard_name: time
axis: T
NetCDF 数据具有如下元数据(使用 ncdump -h):
svimal@lettenmaierlab06:/mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/Subset$ ncdump -h data_19800530.nc
netcdf data_19800530 {
dimensions:
lon = 503 ;
lat = 170 ;
time = UNLIMITED ; // (1 currently)
variables:
double lon(lon) ;
lon:standard_name = "longitude" ;
lon:long_name = "longitude" ;
lon:units = "degrees_east" ;
lon:axis = "X" ;
double lat(lat) ;
lat:standard_name = "latitude" ;
lat:long_name = "latitude" ;
lat:units = "degrees_north" ;
lat:axis = "Y" ;
double time(time) ;
time:standard_name = "time" ;
time:units = "hours since 1999-5-16 00:00:00" ;
time:calendar = "standard" ;
time:axis = "T" ;
float air_temp(time, lat, lon) ;
air_temp:long_name = "air temperuature (C)" ;
air_temp:_FillValue = -9.99e+08f ;
air_temp:missing_value = -9.99e+08f ;
float vp(time, lat, lon) ;
vp:long_name = "vapor pressure (kPa)" ;
vp:_FillValue = -9.99e+08f ;
vp:missing_value = -9.99e+08f ;
float pressure(time, lat, lon) ;
pressure:long_name = "pressure (kPa)" ;
pressure:_FillValue = -9.99e+08f ;
pressure:missing_value = -9.99e+08f ;
float windspd(time, lat, lon) ;
windspd:long_name = "wind (m/s)" ;
windspd:_FillValue = -9.99e+08f ;
windspd:missing_value = -9.99e+08f ;
float shortwave(time, lat, lon) ;
shortwave:long_name = "downward shortwave (W/m^2)" ;
shortwave:_FillValue = -9.99e+08f ;
shortwave:missing_value = -9.99e+08f ;
float longwave(time, lat, lon) ;
longwave:long_name = "downward longwave (W/m^2)" ;
longwave:_FillValue = -9.99e+08f ;
longwave:missing_value = -9.99e+08f ;
float precip(time, lat, lon) ;
precip:long_name = "precipitation (mm/hr)" ;
precip:_FillValue = -9.99e+08f ;
precip:missing_value = -9.99e+08f ;
// global attributes:
:CDI = "Climate Data Interface version ?? (http://mpimet.mpg.de/cdi)" ;
:Conventions = "CF-1.4" ;
:history = "Tue Mar 20 14:36:48 2018: ncea -d lat,41.375,83.625 -d lon,181.375,306.875 /mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/data_19800530.nc /mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/Subset/data_19800530.nc\n",
"Tue Mar 20 14:36:44 2018: cdo -f nc import_binary /mnt/nfs/home/solomon/Data/CFSR/CFSR-LAND_Global_0.25deg_data_changed.ctl /mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/data_19800530.nc" ;
:CDO = "Climate Data Operators version 1.7.0 (http://mpimet.mpg.de/cdo)" ;
:NCO = "\"4.5.4\"" ;
:nco_openmp_thread_number = 1 ;
}
时间属性表示
time = UNLIMITED ; // (1 currently)
我不确定这是什么意思,这可能是问题所在吗?
您确定您对 glob.glob()
的使用仅返回带有 1980 年时间的 netCDF 文件吗?
我建议 spot-check 带有显式循环(跳过 open_mfdataset
):
files = glob.glob("/mnt/nfs/home/solomon/Data/CFSR/NetCDFs_1979-2013/Subset/data_1980*")
for path in files:
ds = xarray.open_dataset(path, engine="netcdf4")
print(path, ds.time.values)
旁注:最好将 glob 字符串直接传递到 open_mfdataset()
而不是显式调用 glob.glob()
。它更简洁一点,xarray 还在它解析的 glob 字符串上调用 sorted()
,而不是依赖于 glob.glob()
.
谢谢@shoyer!问题出在我的文件上。以“1980”开头的文件名包含其他年份的数据。发生这种情况是因为我修改了相同的输入控制文件以并行创建多个 netcdf 文件,使用:
cdo -f nc import_binary CFSR-LAND_Global_0.25deg_data_changed.ctl data_19800530.nc
为每个并行线程创建唯一的 ctl 文件解决了这个问题。