R: 函数内的 if 语句 (lapply)
R: if statement inside function (lapply)
我有一大堆数据框,其中包含来自不同地区的环境变量。对于列表中的每个数据框,我想汇总跨地区的值(= 将同一地区的测量值分组),使用数据框的名称作为需要汇总变量的条件。例如,对于名称为 'salinity' 的数据框,我只想总结盐度,而不是其他环境变量。请注意,不同的数据帧包含来自不同地区的数据,因此我不能简单地将它们合并到一个数据帧中。
让我们用一个虚拟数据集来做这个:
#create list of dataframes
df1 = data.frame(locality = c(1, 2, 2, 5, 7, 7, 9),
Temp = c(14, 15, 16, 18, 20, 18, 21),
Sal = c(16, NA, NA, 12, NA, NA, 9))
df2 = data.frame(locality = c(1, 1, 3, 6, 8, 9, 9),
Temp = c(1, 2, 4, 5, 0, 2, -1),
Sal = c(18, NA, NA, NA, 36, NA, NA))
df3 = data.frame(locality = c(1, 3, 4, 4, 5, 5, 9),
Temp = c(14, NA, NA, NA, 17, 18, 21),
Sal = c(16, 8, 24, 23, 11, 12, 9))
df4 = data.frame(locality = c(1, 1, 1, 4, 7, 8, 10),
Temp = c(1, NA, NA, NA, NA, 0, 2),
Sal = c(18, 17, 13, 16, 20, 36, 30))
df_list = list(df1, df2, df3, df4)
names(df_list) = c("Summer_temperature", "Winter_temperature",
"Summer_salinity", "Winter_salinity")
接下来我用lapply总结了环境变量:
#select only those dataframes in the list that have either 'salinity' or 'temperature' in the dataframe names
df_sal = df_list[grep("salinity", names(df_list))]
df_temp = df_list[grep("temperature", names(df_list))]
#use apply to summarize salinity or temperature values in each dataframe
##salinity
df_sal2 = lapply(df_sal, function(x) {
x %>%
group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))
})
##temperature
df_temp2 = lapply(df_temp, function(x) {
x %>%
group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))
})
现在,这段代码是重复的,所以我想通过将所有内容合并到一个函数中来缩小它的规模。这是我试过的:
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Sal = mean(Sal, na.rm = TRUE))}
if (grepl("temperature", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Temp = mean(Temp, na.rm = TRUE))}
})
但我得到以下输出:
$Summer_temperature
NULL
$Winter_temperature
NULL
$Summer_salinity
NULL
$Winter_salinity
NULL
以及以下警告消息:
Warning messages:
1: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
2: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
3: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
4: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
5: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
6: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
7: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
8: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
现在,我读到 可以使用 ifelse
解决此警告消息。然而,在最终的数据集中,我将有两个以上的环境变量,所以我将不得不添加更多的 if
语句——因此我认为 ifelse
不是这里的解决方案。有没有人对我的问题有一个优雅的解决方案?我刚开始使用这两个函数和 lapply,如果您能给我任何帮助,我将不胜感激。
编辑:
我尝试使用其中一个答案中建议的 else if 选项,但这仍然是 returns NULL 值。我还尝试了 return 并将输出分配给 x 但两者都存在与以下代码相同的问题 - 有什么想法吗?
#else if
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))}
else if (grepl("temperature", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))}
})
df_env
我认为正在发生的事情是我的 if 参数没有传递给汇总函数,所以没有汇总任何内容。
这里发生了几件事,包括
正如 akrun 所说,if
语句必须有一个长度为 1 的条件。你的不是。
grepl("locality", names(df1))
# [1] TRUE FALSE FALSE
必须减少它,以便它始终正好是长度 1。坦率地说,grepl
在这里是错误的工具,因为从技术上讲,名为 notlocality
的列会匹配,然后它会出错。我建议你改成
"locality" %in% names(df1)
# [1] TRUE
你需要return一些东西。总是。您从 if ...; if ...;
转移到 if ... else if ...
,这是一个好的开始,但实际上如果您不满足这两个条件,那么什么都不会 returned。我建议以下之一:再添加一个 } else x
,或重新分配为 if (..) { x <- x %>% ...; } else if (..) { x <- x %>% ... ; }
,然后仅用 x
结束 anon-func(到 return) .
但是,我认为最终的问题是您正在寻找 "temperature"
或 "salinity"
,它们在 list
对象的名称中,而不是在框架本身中。例如,您对 names(x)
的引用是 returning c("locality", "Temp", "Sal")
,框架的名称 x
本身。
我想这就是你想要的?
Map(function(x, nm) {
if (grepl("salinity", nm)) {
x %>%
group_by(locality) %>%
summarize(Sal = mean(Sal, na.rm = TRUE))
} else if (grepl("temperature", nm)) {
x %>%
group_by(locality) %>%
summarize(Temp = mean(Temp, na.rm = TRUE))
} else x
}, df_list, names(df_list))
# $Summer_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 14
# 2 2 15.5
# 3 5 18
# 4 7 19
# 5 9 21
# $Winter_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 1.5
# 2 3 4
# 3 6 5
# 4 8 0
# 5 9 0.5
# $Summer_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 3 8
# 3 4 23.5
# 4 5 11.5
# 5 9 9
# $Winter_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 4 16
# 3 7 20
# 4 8 36
# 5 10 30
我有一大堆数据框,其中包含来自不同地区的环境变量。对于列表中的每个数据框,我想汇总跨地区的值(= 将同一地区的测量值分组),使用数据框的名称作为需要汇总变量的条件。例如,对于名称为 'salinity' 的数据框,我只想总结盐度,而不是其他环境变量。请注意,不同的数据帧包含来自不同地区的数据,因此我不能简单地将它们合并到一个数据帧中。
让我们用一个虚拟数据集来做这个:
#create list of dataframes
df1 = data.frame(locality = c(1, 2, 2, 5, 7, 7, 9),
Temp = c(14, 15, 16, 18, 20, 18, 21),
Sal = c(16, NA, NA, 12, NA, NA, 9))
df2 = data.frame(locality = c(1, 1, 3, 6, 8, 9, 9),
Temp = c(1, 2, 4, 5, 0, 2, -1),
Sal = c(18, NA, NA, NA, 36, NA, NA))
df3 = data.frame(locality = c(1, 3, 4, 4, 5, 5, 9),
Temp = c(14, NA, NA, NA, 17, 18, 21),
Sal = c(16, 8, 24, 23, 11, 12, 9))
df4 = data.frame(locality = c(1, 1, 1, 4, 7, 8, 10),
Temp = c(1, NA, NA, NA, NA, 0, 2),
Sal = c(18, 17, 13, 16, 20, 36, 30))
df_list = list(df1, df2, df3, df4)
names(df_list) = c("Summer_temperature", "Winter_temperature",
"Summer_salinity", "Winter_salinity")
接下来我用lapply总结了环境变量:
#select only those dataframes in the list that have either 'salinity' or 'temperature' in the dataframe names
df_sal = df_list[grep("salinity", names(df_list))]
df_temp = df_list[grep("temperature", names(df_list))]
#use apply to summarize salinity or temperature values in each dataframe
##salinity
df_sal2 = lapply(df_sal, function(x) {
x %>%
group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))
})
##temperature
df_temp2 = lapply(df_temp, function(x) {
x %>%
group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))
})
现在,这段代码是重复的,所以我想通过将所有内容合并到一个函数中来缩小它的规模。这是我试过的:
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Sal = mean(Sal, na.rm = TRUE))}
if (grepl("temperature", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Temp = mean(Temp, na.rm = TRUE))}
})
但我得到以下输出:
$Summer_temperature
NULL
$Winter_temperature
NULL
$Summer_salinity
NULL
$Winter_salinity
NULL
以及以下警告消息:
Warning messages:
1: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
2: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
3: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
4: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
5: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
6: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
7: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
8: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
现在,我读到 ifelse
解决此警告消息。然而,在最终的数据集中,我将有两个以上的环境变量,所以我将不得不添加更多的 if
语句——因此我认为 ifelse
不是这里的解决方案。有没有人对我的问题有一个优雅的解决方案?我刚开始使用这两个函数和 lapply,如果您能给我任何帮助,我将不胜感激。
编辑:
我尝试使用其中一个答案中建议的 else if 选项,但这仍然是 returns NULL 值。我还尝试了 return 并将输出分配给 x 但两者都存在与以下代码相同的问题 - 有什么想法吗?
#else if
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))}
else if (grepl("temperature", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))}
})
df_env
我认为正在发生的事情是我的 if 参数没有传递给汇总函数,所以没有汇总任何内容。
这里发生了几件事,包括
正如 akrun 所说,
if
语句必须有一个长度为 1 的条件。你的不是。grepl("locality", names(df1)) # [1] TRUE FALSE FALSE
必须减少它,以便它始终正好是长度 1。坦率地说,
grepl
在这里是错误的工具,因为从技术上讲,名为notlocality
的列会匹配,然后它会出错。我建议你改成"locality" %in% names(df1) # [1] TRUE
你需要return一些东西。总是。您从
if ...; if ...;
转移到if ... else if ...
,这是一个好的开始,但实际上如果您不满足这两个条件,那么什么都不会 returned。我建议以下之一:再添加一个} else x
,或重新分配为if (..) { x <- x %>% ...; } else if (..) { x <- x %>% ... ; }
,然后仅用x
结束 anon-func(到 return) .
但是,我认为最终的问题是您正在寻找 "temperature"
或 "salinity"
,它们在 list
对象的名称中,而不是在框架本身中。例如,您对 names(x)
的引用是 returning c("locality", "Temp", "Sal")
,框架的名称 x
本身。
我想这就是你想要的?
Map(function(x, nm) {
if (grepl("salinity", nm)) {
x %>%
group_by(locality) %>%
summarize(Sal = mean(Sal, na.rm = TRUE))
} else if (grepl("temperature", nm)) {
x %>%
group_by(locality) %>%
summarize(Temp = mean(Temp, na.rm = TRUE))
} else x
}, df_list, names(df_list))
# $Summer_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 14
# 2 2 15.5
# 3 5 18
# 4 7 19
# 5 9 21
# $Winter_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 1.5
# 2 3 4
# 3 6 5
# 4 8 0
# 5 9 0.5
# $Summer_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 3 8
# 3 4 23.5
# 4 5 11.5
# 5 9 9
# $Winter_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 4 16
# 3 7 20
# 4 8 36
# 5 10 30