如何在不使用 for 的情况下读取大型 NetCDF 数据集 - Python

How to read large NetCDF data sets without using a for - Python

早上好,我在读取 python 中包含气象信息的大型 netCDF 文件时遇到问题,该信息必须经过它才能得到 assemble 信息,然后将其插入到数据库,但需要时间和 assemble 信息太多,我知道必须有其他方法可以更有效地执行相同的过程,目前我通过代码下面的 for 循环访问信息

 content = nc.Dataset(pathFile+file)
 XLONG, XLAT = content.variables["XLONG"], content.variables["XLAT"]
 Times = content.variables["Times"]  #Horas formar b 'b
 RAINC  =  content.variables["RAINC"] #Lluvia
 Q2 = content.variables["Q2"] #Humedad especifica
 T2 = content.variables["T2"] #Temperatura
 U10 = content.variables["U10"] #Viento zonal
 V10 = content.variables["V10"] #Viento meridional
 SWDOWN = content.variables["SWDOWN"] #Radiacion incidente
 PSFC = content.variables["PSFC"] #Presion de la superficie
 SST = content.variables["SST"] #Temperatura de la superficie del mar
CLDFRA = content.variables["CLDFRA"] #Fraccion de nubes

 for c2 in range(len(XLONG[0])):
    for c3 in range(len(XLONG[0][c2])):
    position += 1  
    for hour in range(len(Times)):
        dateH = getDatetimeInit(dateFormatFile.hour) if hour == 0 else getDatetimeForHour(hour, dateFormatFile.hour)
        hourUTC = getHourUTC(hour)        

        RAINH = str(RAINC[hour][0][c2][c3])
        Q2H = str(Q2[hour][0][c2][c3])
        T2H = str(convertKelvinToCelsius(T2[hour][0][c2][c3]))
        U10H = str(U10[hour][0][c2][c3])
        V10H = str(V10[hour][0][c2][c3])
        SWDOWNH = str(SWDOWN[hour][0][c2][c3])
        PSFCH = str(PSFC[hour][0][c2][c3])
        SSTH = str(SST[hour][0][c2][c3])
        CLDFRAH = str(CLDFRA[hour][0][c2][c3] )


        rowData = [idRun, functions.IDMODEL, idTime, position, dateH.year, dateH.month, dateH.day, dateH.hour, RAINH, Q2H, T2H, U10H, V10H, SWDOWNH, PSFCH, SSTH, CLDFRAH]           
        dataProcess.append(rowData)

我会使用 NumPy。让我们假设您有带有 2 个变量的 netCDF,“t2”和“slp”。然后您可以使用以下代码对数据进行矢量化处理:

#!//usr/bin/env ipython
# ---------------------
import numpy as np
from netCDF4 import Dataset
# ---------------------
filein = 'test.nc'
ncin = Dataset(filein);
tair = ncin.variables['t2'][:];
slp  = ncin.variables['slp'][:];
ncin.close();
# -------------------------
tairseries = np.reshape(tair,(np.size(tair),1));
slpseries =  np.reshape(slp,(np.size(slp),1));
# --------------------------
## if you want characters:
#tairseries = np.array([str(val) for val in tairseries]);
#slpseries = np.array([str(val) for val in slpseries]);
# --------------------------
rowdata = np.concatenate((tairseries,slpseries),axis=1);
# if you want characters, do this in the end:
row_asstrings = [[str(vv) for vv in val] for val in rowdata]
# ---------------------------

不过,我觉得使用字符串并不是一个好主意。在我的示例中,从数值数组到字符串的转换花费了相当长的时间,因此我没有在连接之前实现它。

如果您还需要一些 time/location 信息,您可以这样做:

#!//usr/bin/env ipython
# ---------------------
import numpy as np
from netCDF4 import Dataset
# ---------------------
filein = 'test.nc'
ncin = Dataset(filein);
xin = ncin.variables['lon'][:]
yin = ncin.variables['lat'][:]
timein = ncin.variables['time'][:]
tair = ncin.variables['t2'][:];
slp  = ncin.variables['slp'][:];
ncin.close();
# -------------------------
tairseries = np.reshape(tair,(np.size(tair),1));
slpseries =  np.reshape(slp,(np.size(slp),1));
# --------------------------
## if you want characters:
#tairseries = np.array([str(val) for val in tairseries]);
#slpseries = np.array([str(val) for val in slpseries]);
# --------------------------
rowdata = np.concatenate((tairseries,slpseries),axis=1);
# if you want characters, do this in the end:
#row_asstrings = [[str(vv) for vv in val] for val in rowdata]
# ---------------------------
# =========================================================
nx = np.size(xin);ny = np.size(yin);ntime = np.size(timein);
xm,ym = np.meshgrid(xin,yin);
xmt = np.tile(xm,(ntime,1,1));ymt = np.tile(ym,(ntime,1,1))
timem = np.tile(timein[:,np.newaxis,np.newaxis],(1,ny,nx));
xvec = np.reshape(xmt,(np.size(tair),1));yvec = np.reshape(ymt,(np.size(tair),1));timevec = np.reshape(timem,(np.size(tair),1)); # to make sure that array's size match, I am using the size of one of the variables
rowdata = np.concatenate((xvec,yvec,timevec,tairseries,slpseries),axis=1);

在任何情况下,对于可变大小 (744,150,150),矢量化 2 个变量花费的时间不到 2 秒。