遍历子集，获取文件并将结果保存在数据框中

Question

已经有人问过类似的问题，但 none 能够解决我的具体问题。我有一个 .R 文件 ("Mycalculus.R")，其中包含许多我需要应用于数据框子集的基本微积分：每年一个子集，其中 "year" 的模态是因子（yearA， yearB, yearC) 不是数值。该文件生成一个新的数据框，我需要将其保存在 Rda 文件中。这是我希望代码看起来像一个 for 循环（这个显然不起作用）：

id <- identif(unlist(df$year))
for (i in 1:length(id)){
    data <- subset(df, year == id[i])
    source ("Mycalculus.R", echo=TRUE)
    save(content_df1,file="myresults.Rda")
}

这是主要的 data.frame df:

obs    year    income    gender   ageclass    weight
 1     yearA    1000       F         1          10
 2     yearA    1200       M         2          25
 3     yearB    1400       M         2           5
 4     yearB    1350       M         1          11

这是源文件 "Mycalculus.R" 所做的：它将大量基本微积分应用于名为 "data" 的数据帧的列，并基于 df1 创建两个新数据帧 df1 和 df2。这是摘录：

data <- data %>% 
   group_by(gender) %>% 
   mutate(Income_gender = weighted.mean(income, weight))
data <- data %>% 
   group_by(ageclass) %>% 
   mutate(Income_ageclass = weighted.mean(income, weight))

library(GiniWegNeg)
gini=c(Gini_RSV(data$Income_gender, weight), Gini_RSV(data$Income_ageclass,weight))

df1=data.frame(gini)
colnames(df1) <- c("Income_gender","Income_ageclass")
rownames(df1) <- c("content_df1")

df2=(1/5)*df1$Income_gender+df2$Income_ageclass
colnames(df2) <- c("myresult")
rownames(df2) <- c("content_df2")

所以最后，我得到了两个这样的数据帧：

                    Income_Gender  Income_Ageclass    
content_df1           ....             ....

对于 df2：

                    myresult      
content_df2           ....

但我需要将 df1 和 Rf2 保存为 Rda 文件，其中 content_df1 和 content_df2 的行名称按子集给出，如下所示：

                    Income_Gender  Income_Ageclass    
content_df1_yearA     ....             ....     
content_df1_yearB     ....             ....     
content_df1_yearC     ....             ....

和

                    myresult
content_df2_yearA     ....   
content_df2_yearB     ....    
content_df2_yearC     ....

目前，我的程序没有使用任何循环，并且正在执行该工作但很乱。基本上代码就是2500多行代码。（请不要向我扔西红柿）。

有人可以帮我解决这个具体要求吗？提前谢谢你。

Answer 1

如果您将步骤功能化，您可以创建如下工作流程：

calcFunc <- function(df) {
  ## Do something to the df, then return it
  df
}

processFunc <- function(fname) {
  ## Read in your table
  x <- read.table(fname)

  ## Do the calculation
  x <- calcFunc(x)

  ## Make a new file name (remember to change the file extension)
  new_fname <- sub("something", "else", fname)

  ## Write the .RData file
  save(x, file = new_fname)
}

### Your workflow
## Generate a vector of files
my_files <- list.files()

## Do the work
res <- lapply(my_files, processFunc)

或者，不保存文件。省略 processFunc 中的 save 调用和 return data.frame 对象列表。然后使用 data.table::rbindlist(res) 或 do.call(rbind, list) 来制作一个大的 data.frame 对象。

Answer 2

考虑将所有内容合并到一个脚本中，该脚本具有由 lapply() 调用的所需参数的已定义函数。 Lapply 然后 returns 一个数据帧列表，您可以将其行绑定到一个最终的 df 中。

library(dplyr)
library(GiniWegNeg)

runIncomeCalc <- function(data, y){      
  data <- data %>% 
    group_by(gender) %>% 
    mutate(Income_gender = weighted.mean(income, weight))
  data <- data %>% 
    group_by(ageclass) %>% 
    mutate(Income_ageclass = weighted.mean(income, weight))      

  gini <- c(Gini_RSV(data$Income_gender, weight), Gini_RSV(data$Income_ageclass,weight))

  df1 <- data.frame(gini)
  colnames(df1) <- c("Income_gender","Income_ageclass")
  rownames(df1) <- c(paste0("content_df1_", y))

  return(df1)
}

runResultsCalc <- function(df, y){
  df2 <- (1/5) * df$Income_gender + df$Income_ageclass
  colnames(df2) <- c("myresult")
  rownames(df2) <- c(paste0("content_df2_", y)

  return(df2)
}

dfIncList <- lapply(unique(df$year), function(i) {      
  yeardata <- subset(df, year == i)
  runIncomeCalc(yeardata, i)      
})

dfResList <- lapply(unique(df$year), function(i) {      
  yeardata <- subset(df, year == i)
  df <- runIncomeCalc(yeardata, i) 
  runResultsCalc(df, i)      
})

df1 <- do.call(rbind, dfIncList)
df2 <- do.call(rbind, dfResList)

现在，如果您需要跨脚本获取源代码。在 Mycalculus.R 中创建相同的两个函数，runIncomeCalc 和 runResultsCalc，然后在其他脚本中调用每个函数：

library(dplyr)
library(GiniWegNeg)

if(!exists("runIncomeCalc", mode="function")) source("Mycalculus.R")

dfIncList <- lapply(unique(df$year), function(i) {      
  yeardata <- subset(df, year == i)
  runIncomeCalc(yeardata, i)      
})

dfResList <- lapply(unique(df$year), function(i) {      
  yeardata <- subset(df, year == i)
  df <- runIncomeCalc(yeardata, i) 
  runResultsCalc(df, i)      
})

df1 <- do.call(rbind, dfIncList)
df2 <- do.call(rbind, dfResList)

遍历子集，获取文件并将结果保存在数据框中

Loop over a subset, source a file and save results in a dataframe

for-loop

r

save

rdata