ifelse 语句将值分配给新列，使用数值列表

Question

我有一个看起来像这样的数据框：

# Minimal example dataframe

identifier <- c(
  "A",
  "B",
  "C",
  "D",
  "E",
  "F"
)

value_1 <- c(
  "1231811, 1231877",
  "1231911, 1233069, 1232767",
  "1231919",
  NA,
  "1232135, 1233145",
  NA
)

value_2 <- c(
  1231811,
  190477,
  922661,
  950711,
  992647,
  NA
  
)

value_3 <- c(
  1231877,
  1233069,
  9774041,
  9774041,
  1314063,
  1231379
  
)

test_df <- data.frame(identifier, value_1, value_2, value_3)

  identifier                   value_1 value_2 value_3
1          A          1231811, 1231877 1231811 1231877
2          B 1231911, 1233069, 1232767  190477 1233069
3          C                   1231919  922661 9774041
4          D                      <NA>  950711 9774041
5          E          1232135, 1233145  992647 1314063
6          F                      <NA>    <NA> 1231379

我想创建一个新列“final_value”，并用来自 value_1、value_2 或 value_3 的单个值填充它层次结构优先 value_1 值匹配 value_2 中的值，然后是 value_3。如果 value_1 不是 NA 并且没有匹配 value_2 或 value_3 中的任何值，我想用逗号分隔的 [=] 中的第一个值填充 final_value 29=] 字符串。如果 value_1 为 NULL，则用 value_2 填充 final_value，或者，如果它也为 NULL，则用 value_3 填充。最终数据框如下所示：

  identifier                   value_1 value_2 value_3 final_value
1          A          1231811, 1231877 1231811 1231877 1231811 # 1231811 from value_1 matches value_2 (preferred match)
2          B 1231911, 1233069, 1232767  190477 1233069 1233069 # no values from value_1 match value_2; however, 1233069 from value_1 matches value_3
3          C                   1231919  922661 9774041 1231919 # no values from value_1 match other columns; just fill with value_1
4          D                      <NA>  950711 9774041 950711  # value_1 is NA, so fill in with value_2
5          E          1232135, 1233145  992647 1314063 1232135 # no values from value_1 match other columns, fill with first item from value_1 list
6          F                      <NA>    <NA> 1231379 1231379 # value_1 and value_2 are NA, so fill in with value_3

这是我目前的方法...

library(purrr)
library(dplyr)

# change value_1 column into a list of numeric values 
test_df <- test_df%>% mutate(value_1 = map(value_1,function(x) (as.numeric(unlist(str_split(x,","))))))

# create a new column to hold the final selected value
test_df$final_value <- NA

# ifelse statement
test_df$final_value <- 
  
  # if any of the elements in value_1 match the value_2 value, fill the new column with value_2
  ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_2, test_df$value_2,
         
         # otherwise, if a value in value_1 matches value_3, fill in with value_3
         ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_3, test_df$value_3,
                
                # if none of the values in value_1 match the other columns, fill in with the first value_1 list value
                ifelse(!is.na(test_df$value_1) & !(test_df$value_1 %in% test_df$value_2) & !(test_df$value_1 %in% test_df$value_3), test_df$value_1, #NOTE: have tried test_df$value_1[1] and test_df$value_1[[1]] without success to get the first list item returned
                       
                       # if value_1 is NA, fill in with value_2
                       ifelse(is.na(test_df$value_1) & !is.na(test_df$value_2), test_df$value_2,
                              
                              # if value_1 is NA and value_2 is NA, fill in with value_3
                              ifelse(is.na(test_df$value_1) & is.na(test_df$value_2) & !is.na(test_df$value_3), test_df$value_3, NA
         
         
  )))))

结果有一些问题：

  identifier                   value_1 value_2 value_3               final_value
1          A          1231811, 1231877 1231811 1314063          1231811, 1231877
2          B 1231911, 1233069, 1232767  190477 1233069 1231911, 1233069, 1232767
3          C                   1231919  922661 9774041                   1231919
4          D                        NA  950711 9774041                    950711
5          E          1232135, 1233145  992647 1314063          1232135, 1233145
6          F                        NA      NA 1231379                   1231379

ifelse 的前三行未按预期工作。它无法 return final_value 中匹配的 value_2 或 value_3 值，我也无法将其 return 来自 [=29= 的第一个列表项] 其中没有任何匹配的 value_2 或 value_3 值。对于后者，我已经尝试指定 test_df$value_1[[1]][1]（和类似的）但这只是 returns identifer A value_1 列表中的第一项：

  identifier                   value_1 value_2 value_3 final_value
1          A          1231811, 1231877 1231811 1314063     1231811
2          B 1231911, 1233069, 1232767  190477 1233069     1231811
3          C                   1231919  922661 9774041     1231811
4          D                        NA  950711 9774041      950711
5          E          1232135, 1233145  992647 1314063     1231811
6          F                        NA      NA 1231379     1231379

如有任何帮助，我们将不胜感激。

Answer 1

首先，嵌套 ifelse 超过 2 层通常会让我建议 case_when。但是，在这种情况下，我认为有一个更好的解决方案：

func func <- function(A, ...) {
  if (length(A) == 1L && is.na(A)) {
    if (length(list(...))) na.omit(unlist(list(...)))[1] else NA
  } else {
    L <- lapply(list(...), intersect, x = A)
    L <- c(L[lengths(L) > 0], A)
    L[[1]][1]
  }
}

library(dplyr)
test_df %>%
  mutate(
    final_value = mapply(func, strsplit(value_1, "[, ]+"), value_2, value_3)
  )
#   identifier                   value_1 value_2 value_3 final_value
# 1          A          1231811, 1231877 1231811 1231877     1231811
# 2          B 1231911, 1233069, 1232767  190477 1233069     1233069
# 3          C                   1231919  922661 9774041     1231919
# 4          D                      <NA>  950711 9774041      950711
# 5          E          1232135, 1233145  992647 1314063     1232135
# 6          F                      <NA>      NA 1231379     1231379

因为我在 func 中使用了 ...，这会根据需要处理 “0 个或更多” 其他 value_* 变量；如果您有 3 个或 30 个以上，它将应用相同的逻辑。此外，... 中的顺序很重要，较早列出的那些将在匹配中获得更高的优先级。

c(L[lengths(L) > 0], A) 确保 (1) 我们只考虑 value_* 具有非空交集（第一部分），如果所有这些都是空的，我们使用在 A。（万一 A 是 NA 而所有 value_* 都是空的，那么......你会得到 NA。）

仅供参考，其中一个内部步骤是使用 strsplit 将逗号分隔的数字字符串拆分为列表列。如果您要执行更多需要在其中的各个组件上工作的类似操作，您可能更愿意使用 mutate(value_1 = strsplit(value_1, "[ ,]+"))（或类似的）保持原样。

ifelse 语句将值分配给新列，使用数值列表

ifelse statement to assign values to a new column, working with lists of numeric values

r

string-matching