将列表 netcdf 合并到 R 中的一个数据帧中的最佳方法 - 嵌套 for 循环或 mapply?
Best way to combine lists netcdf into one dataframe in R - Nested for loops or mapply?
我正在尝试将多个 netcdf 文件与多个变量组合:
- 6 types of parameters
-36 years
- 12 months
-31 days
- 6 Y coordinates
- 5 X coordinates
每个netcdf文件包含一年1个月的数据和1个参数,因此有432 * 6 =2592个文件。
我如何最好地将所有这些组合到一个数据框中?它最终必须生成如下内容:
rowID Date year month day coord.X coord.Y par1 par2 par3 par4 par5 par6
1 1979-01-01 1979 01 01 176 428 3.2 0.005 233.5 0.1 12.2 4.4
..................... 402568 rows in between.................
402570 2014-12-31 2014 12 31 180 433 1.7 0.006 235.7 0.2 0.0 2.7
我如何最好地结合它?我已经为此苦苦挣扎了一段时间...
请原谅我不知道如何使这个问题可以重现..但是涉及的因素太多了。
这是我的文件来源:
ftp://rfdata:forceDATA@ftp.iiasa.ac.at/WFDEI/
这就是我目前所拥有的,我认为这就是他们所说的嵌套循环吧?:
我通常只是尝试并尝试并最终成功......但我发现这是一项艰巨的工作。欢迎就第一步提出任何建议。
require(ncdf4)
directory<-c("C:/folder/") # general folder
parameter<-c("par1","par2","par3","par4","par5","par6") # names of 6 parameters
directory2<-c("_folder2/") # parameter specific folder
directory3<-c("name") # last part of folder name
years<-c("1979","otheryears","2014") # years which are also part of netcdf file name
months<-c("01","othermonths","12") # months which are also part of netcdf file name
x=c(176:180) # X-coordinates
y=c(428:433) # Y-coordinates
require(plyr)
for (p in parameter){
assign(paste0(p,"list"), list())
for (i in years){
for (j in months){
for (k in x){
for (l in y){
fileloc<-paste(directory,p,directory2,p,directory3,i,j,".nc",sep="") #location to open
ncin<-nc_open(fileloc)
assign(paste0(p))<-ncvar_get(ncin,p) # extract the desired parameter from the netcdf list "ncin" and store in vector with name of parameter
day<-ncvar_get(ncin,"day") # extract the day of month from the netcdf list "ncin"
par.coord<-paste(p,"[",y,",",x,",","]",sep="") #string with function to select coordinates
temp<-data.frame(i,j,day,p=par.coord) # store day and parameter in dataframe
temp<-cbind(date=as.Date(with(temp,paste(i,j,day,sep="-")),"%Y-%m-%d"),temp,Y=y,X=x) # Add date and coordinates to df
assign(paste0(p,"list"), list(temp) #store multiple data frames in a list.. I think?
}assign(paste0(p,"list"), do.call(rbind,data) # something to bind the dataframes by row in a list
}}}}
有许多 种方法可以像这样给猫剥皮。如果您是 R 的新手,嵌套循环可能更容易调试。我认为您想问自己的一个问题是文件是否具有首要地位,或者您的概念结构是否具有首要地位。也就是说,如果您的概念结构指定了一个没有文件的位置,您希望您的代码做什么?如果您只想尝试解析现有文件,我发现使用 list.files(, full.names = TRUE, recursive = TRUE)
查找我想要解析的文件然后编写一个函数来解析单个文件(及其名称)以生成数据很有用我想要的结构。从那里开始,它是 lapply
或 purrr::map
。
为了通过将所有 Netcdf 文件提取并分组到一个数据帧中来提取这些 Netcdf 文件:
-6 parameters
-36 years
-12 months
-31 days
-6 Y coordinates
-5 X coordinates
首先,我确保所有 *.nc 文件都在一个文件夹中。
其次,我将多个 for 循环简化为一个,因为年、月和参数变量可从文件名中获得:
变量day、Xcoord和Y coord可以提取为一个数组。
require(arrayhelpers);require(stringr);require(plyr);require(ncdf4)
# store all files from ftp://rfdata:forceDATA@ftp.iiasa.ac.at/WFDEI/ in the following folder:
setwd("C:/folder")
temp = list.files(pattern="*.nc") #list all the file names
param<-gsub("_\S+","",temp,perl=T) #extract parameter from file name
xcoord=seq(176,180,by=1) #The X-coordinates you are interested in
ycoord=seq(428,433,by=1) #The Y-coordinates you are interested in
list_var<-list() # make an empty list
for (t in 1:length(temp)){
temp_year<-str_sub(temp[],-9,-6) #take string number last place minus 9 till last place minus 6 to extract the year from file name
temp_month<-str_sub(temp[],-5,-4) #take string number last place minus 9 till last place minus 6 to extract the month from file name
temp_netcdf<-nc_open(temp[t])
temp_day<-rep(seq(1:length(ncvar_get(temp_netcdf,"day"))),length(xcoord)*length(ycoord)) # make a string of day numbers the same length as amount of values
dim.order<-sapply(temp_netcdf[["var"]][[param[t]]][["dim"]],function(x) x$name) # gives the name of each level of the array
start <- c(lon = 428, lat = 176, tstep = 1) # indicates the starting value of each variable
count <- c(lon = 6, lat = 5, tstep = length(ncvar_get(nc_open(temp[t]),"day"))) # indicates how many values of each variable have to be present starting from start
tempstore<-ncvar_get(temp_netcdf, param[t], start = start[dim.order], count = count[dim.order]) # array with parameter values
df_temp<-array2df (tempstore, levels = list(lon=ycoord, lat = xcoord, day = NA), label.x = "value") # convert array to dataframe
Add_date<-sort(as.Date(paste(temp_year[t],"-",temp_month[t],"-",temp_day,sep=""),"%Y-%m-%d"),decreasing=FALSE) # make vector with the dates
list_var[t]<-list(data.frame(Add_date,df_temp,parameter=param[t])) #add dates to data frame and store in a list of all output files
nc_close(temp_netcdf) #close nc file to prevent data loss and prevent error when working with a lot of files
}
All_NetCDF_var_in1df<-do.call(rbind,list_var)
#### If you want to take a look at the netcdf files first use:
list2env(
lapply(setNames(temp, make.names(gsub("*.nc$", "", temp))),
nc_open), envir = .GlobalEnv) #import all parameters lists to global environment
我正在尝试将多个 netcdf 文件与多个变量组合:
- 6 types of parameters
-36 years
- 12 months
-31 days
- 6 Y coordinates
- 5 X coordinates
每个netcdf文件包含一年1个月的数据和1个参数,因此有432 * 6 =2592个文件。
我如何最好地将所有这些组合到一个数据框中?它最终必须生成如下内容:
rowID Date year month day coord.X coord.Y par1 par2 par3 par4 par5 par6
1 1979-01-01 1979 01 01 176 428 3.2 0.005 233.5 0.1 12.2 4.4
..................... 402568 rows in between.................
402570 2014-12-31 2014 12 31 180 433 1.7 0.006 235.7 0.2 0.0 2.7
我如何最好地结合它?我已经为此苦苦挣扎了一段时间...
请原谅我不知道如何使这个问题可以重现..但是涉及的因素太多了。 这是我的文件来源: ftp://rfdata:forceDATA@ftp.iiasa.ac.at/WFDEI/
这就是我目前所拥有的,我认为这就是他们所说的嵌套循环吧?: 我通常只是尝试并尝试并最终成功......但我发现这是一项艰巨的工作。欢迎就第一步提出任何建议。
require(ncdf4)
directory<-c("C:/folder/") # general folder
parameter<-c("par1","par2","par3","par4","par5","par6") # names of 6 parameters
directory2<-c("_folder2/") # parameter specific folder
directory3<-c("name") # last part of folder name
years<-c("1979","otheryears","2014") # years which are also part of netcdf file name
months<-c("01","othermonths","12") # months which are also part of netcdf file name
x=c(176:180) # X-coordinates
y=c(428:433) # Y-coordinates
require(plyr)
for (p in parameter){
assign(paste0(p,"list"), list())
for (i in years){
for (j in months){
for (k in x){
for (l in y){
fileloc<-paste(directory,p,directory2,p,directory3,i,j,".nc",sep="") #location to open
ncin<-nc_open(fileloc)
assign(paste0(p))<-ncvar_get(ncin,p) # extract the desired parameter from the netcdf list "ncin" and store in vector with name of parameter
day<-ncvar_get(ncin,"day") # extract the day of month from the netcdf list "ncin"
par.coord<-paste(p,"[",y,",",x,",","]",sep="") #string with function to select coordinates
temp<-data.frame(i,j,day,p=par.coord) # store day and parameter in dataframe
temp<-cbind(date=as.Date(with(temp,paste(i,j,day,sep="-")),"%Y-%m-%d"),temp,Y=y,X=x) # Add date and coordinates to df
assign(paste0(p,"list"), list(temp) #store multiple data frames in a list.. I think?
}assign(paste0(p,"list"), do.call(rbind,data) # something to bind the dataframes by row in a list
}}}}
有许多 种方法可以像这样给猫剥皮。如果您是 R 的新手,嵌套循环可能更容易调试。我认为您想问自己的一个问题是文件是否具有首要地位,或者您的概念结构是否具有首要地位。也就是说,如果您的概念结构指定了一个没有文件的位置,您希望您的代码做什么?如果您只想尝试解析现有文件,我发现使用 list.files(, full.names = TRUE, recursive = TRUE)
查找我想要解析的文件然后编写一个函数来解析单个文件(及其名称)以生成数据很有用我想要的结构。从那里开始,它是 lapply
或 purrr::map
。
为了通过将所有 Netcdf 文件提取并分组到一个数据帧中来提取这些 Netcdf 文件:
-6 parameters
-36 years
-12 months
-31 days
-6 Y coordinates
-5 X coordinates
首先,我确保所有 *.nc 文件都在一个文件夹中。 其次,我将多个 for 循环简化为一个,因为年、月和参数变量可从文件名中获得:
变量day、Xcoord和Y coord可以提取为一个数组。
require(arrayhelpers);require(stringr);require(plyr);require(ncdf4)
# store all files from ftp://rfdata:forceDATA@ftp.iiasa.ac.at/WFDEI/ in the following folder:
setwd("C:/folder")
temp = list.files(pattern="*.nc") #list all the file names
param<-gsub("_\S+","",temp,perl=T) #extract parameter from file name
xcoord=seq(176,180,by=1) #The X-coordinates you are interested in
ycoord=seq(428,433,by=1) #The Y-coordinates you are interested in
list_var<-list() # make an empty list
for (t in 1:length(temp)){
temp_year<-str_sub(temp[],-9,-6) #take string number last place minus 9 till last place minus 6 to extract the year from file name
temp_month<-str_sub(temp[],-5,-4) #take string number last place minus 9 till last place minus 6 to extract the month from file name
temp_netcdf<-nc_open(temp[t])
temp_day<-rep(seq(1:length(ncvar_get(temp_netcdf,"day"))),length(xcoord)*length(ycoord)) # make a string of day numbers the same length as amount of values
dim.order<-sapply(temp_netcdf[["var"]][[param[t]]][["dim"]],function(x) x$name) # gives the name of each level of the array
start <- c(lon = 428, lat = 176, tstep = 1) # indicates the starting value of each variable
count <- c(lon = 6, lat = 5, tstep = length(ncvar_get(nc_open(temp[t]),"day"))) # indicates how many values of each variable have to be present starting from start
tempstore<-ncvar_get(temp_netcdf, param[t], start = start[dim.order], count = count[dim.order]) # array with parameter values
df_temp<-array2df (tempstore, levels = list(lon=ycoord, lat = xcoord, day = NA), label.x = "value") # convert array to dataframe
Add_date<-sort(as.Date(paste(temp_year[t],"-",temp_month[t],"-",temp_day,sep=""),"%Y-%m-%d"),decreasing=FALSE) # make vector with the dates
list_var[t]<-list(data.frame(Add_date,df_temp,parameter=param[t])) #add dates to data frame and store in a list of all output files
nc_close(temp_netcdf) #close nc file to prevent data loss and prevent error when working with a lot of files
}
All_NetCDF_var_in1df<-do.call(rbind,list_var)
#### If you want to take a look at the netcdf files first use:
list2env(
lapply(setNames(temp, make.names(gsub("*.nc$", "", temp))),
nc_open), envir = .GlobalEnv) #import all parameters lists to global environment