R: 函数内的 if 语句 (lapply)

R: if statement inside function (lapply)

我有一大堆数据框,其中包含来自不同地区的环境变量。对于列表中的每个数据框,我想汇总跨地区的值(= 将同一地区的测量值分组),使用数据框的名称作为需要汇总变量的条件。例如,对于名称为 'salinity' 的数据框,我只想总结盐度,而不是其他环境变量。请注意,不同的数据帧包含来自不同地区的数据,因此我不能简单地将它们合并到一个数据帧中。

让我们用一个虚拟数据集来做这个:

#create list of dataframes
df1 = data.frame(locality = c(1, 2, 2, 5, 7, 7, 9),
                     Temp = c(14, 15, 16, 18, 20, 18, 21),
                     Sal = c(16, NA, NA, 12, NA, NA, 9))

df2 = data.frame(locality = c(1, 1, 3, 6, 8, 9, 9),
                 Temp = c(1, 2, 4, 5, 0, 2, -1),
                 Sal = c(18, NA, NA, NA, 36, NA, NA))

df3 = data.frame(locality = c(1, 3, 4, 4, 5, 5, 9),
                 Temp = c(14, NA, NA, NA, 17, 18, 21),
                 Sal = c(16, 8, 24, 23, 11, 12, 9))

df4 = data.frame(locality = c(1, 1, 1, 4, 7, 8, 10),
                 Temp = c(1, NA, NA, NA, NA, 0, 2),
                 Sal = c(18, 17, 13, 16, 20, 36, 30))

df_list = list(df1, df2, df3, df4)
names(df_list) = c("Summer_temperature", "Winter_temperature",
                   "Summer_salinity", "Winter_salinity")

接下来我用lapply总结了环境变量:

#select only those dataframes in the list that have either 'salinity' or 'temperature' in the dataframe names
df_sal = df_list[grep("salinity", names(df_list))]  
df_temp = df_list[grep("temperature", names(df_list))]  

#use apply to summarize salinity or temperature values in each dataframe
##salinity
df_sal2 = lapply(df_sal, function(x) {
      x %>%
        group_by(locality) %>% 
        summarise(Sal = mean(Sal, na.rm = TRUE)) 
    })
        
##temperature
df_temp2 = lapply(df_temp, function(x) {
      x %>%
        group_by(locality) %>% 
        summarise(Temp = mean(Temp, na.rm = TRUE)) 
    })

现在,这段代码是重复的,所以我想通过将所有内容合并到一个函数中来缩小它的规模。这是我试过的:

df_env = lapply(df_list, function(x) {
  if (grepl("salinity", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Sal = mean(Sal, na.rm = TRUE))}
  if (grepl("temperature", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Temp = mean(Temp, na.rm = TRUE))}
  })

但我得到以下输出:

$Summer_temperature
NULL

$Winter_temperature
NULL

$Summer_salinity
NULL

$Winter_salinity
NULL

以及以下警告消息:

Warning messages:
1: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
2: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
3: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
4: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
5: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
6: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
7: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
8: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used

现在,我读到 可以使用 ifelse 解决此警告消息。然而,在最终的数据集中,我将有两个以上的环境变量,所以我将不得不添加更多的 if 语句——因此我认为 ifelse 不是这里的解决方案。有没有人对我的问题有一个优雅的解决方案?我刚开始使用这两个函数和 lapply,如果您能给我任何帮助,我将不胜感激。

编辑:

我尝试使用其中一个答案中建议的 else if 选项,但这仍然是 returns NULL 值。我还尝试了 return 并将输出分配给 x 但两者都存在与以下代码相同的问题 - 有什么想法吗?

#else if
df_env = lapply(df_list, function(x) {
  if (grepl("salinity", names(x)) == TRUE) {
    x %>% group_by(locality) %>% 
      summarise(Sal = mean(Sal, na.rm = TRUE))}
  else if (grepl("temperature", names(x)) == TRUE) {
    x %>% group_by(locality) %>% 
      summarise(Temp = mean(Temp, na.rm = TRUE))}
})
df_env

我认为正在发生的事情是我的 if 参数没有传递给汇总函数,所以没有汇总任何内容。

这里发生了几件事,包括

  1. 正如 akrun 所说,if 语句必须有一个长度为 1 的条件。你的不是。

    grepl("locality", names(df1))
    # [1]  TRUE FALSE FALSE
    

    必须减少它,以便它始终正好是长度 1。坦率地说,grepl 在这里是错误的工具,因为从技术上讲,名为 notlocality 的列会匹配,然后它会出错。我建议你改成

    "locality" %in% names(df1)
    # [1] TRUE
    
  2. 你需要return一些东西。总是。您从 if ...; if ...; 转移到 if ... else if ...,这是一个好的开始,但实际上如果您不满足这两个条件,那么什么都不会 returned。我建议以下之一:再添加一个 } else x,或重新分配为 if (..) { x <- x %>% ...; } else if (..) { x <- x %>% ... ; },然后仅用 x 结束 anon-func(到 return) .

但是,我认为最终的问题是您正在寻找 "temperature""salinity",它们在 list 对象的名称中,而不是在框架本身中。例如,您对 names(x) 的引用是 returning c("locality", "Temp", "Sal"),框架的名称 x 本身。

我想这就是你想要的?

Map(function(x, nm) {
  if (grepl("salinity", nm)) {
    x %>%
      group_by(locality) %>%
      summarize(Sal = mean(Sal, na.rm = TRUE))
  } else if (grepl("temperature", nm)) {
    x %>%
      group_by(locality) %>%
      summarize(Temp = mean(Temp, na.rm = TRUE))
  } else x
}, df_list, names(df_list))
# $Summer_temperature
# # A tibble: 5 x 2
#   locality  Temp
#      <dbl> <dbl>
# 1        1  14  
# 2        2  15.5
# 3        5  18  
# 4        7  19  
# 5        9  21  
# $Winter_temperature
# # A tibble: 5 x 2
#   locality  Temp
#      <dbl> <dbl>
# 1        1   1.5
# 2        3   4  
# 3        6   5  
# 4        8   0  
# 5        9   0.5
# $Summer_salinity
# # A tibble: 5 x 2
#   locality   Sal
#      <dbl> <dbl>
# 1        1  16  
# 2        3   8  
# 3        4  23.5
# 4        5  11.5
# 5        9   9  
# $Winter_salinity
# # A tibble: 5 x 2
#   locality   Sal
#      <dbl> <dbl>
# 1        1    16
# 2        4    16
# 3        7    20
# 4        8    36
# 5       10    30