根据其他两列中的值创建 yes/no 列

Question

我有一个如下所示的数据集：

df <- structure(list(ID = 1:10, Region1 = c("Europe", "NA", 
"Asia", "NA", "Europe", "NA", "Africa", "NA", "Europe", "North America"), Region2 = c("NA", "Europe", 
"NA", "NA", "NA", "Europe", 
"NA", "NA", "NA", "NA"
)), 
class = "data.frame", row.names = c(NA, -10L))

我想创建一个名为 EuropeYN 的新列，它是或否取决于区域列（region1 或 region2）是否包含“欧洲”。最终数据应如下所示：

df <- structure(list(ID = 1:10, Region1 = c("Europe", "NA", 
"Asia", "NA", "Europe", "NA", "Africa", "NA", "Europe", "North America"), Region2 = c("NA", "Europe", 
"NA", "NA", "NA", "Europe", 
"NA", "NA", "NA", "NA"
), EuropeYN = c("yes", "yes", "no", "no", "yes", "yes", "no", "no", "yes", "no")), 
class = "data.frame", row.names = c(NA, -10L))

如果只是检查“欧洲”是否出现在一列中，我知道该怎么做，但是当跨多列检查时我不知道该怎么做。如果只有一栏，我会这样做：

df$EuropeYN <- ifelse(grepl("Europe",df$region1), "yes", "no")

关于解决此问题的最佳方法有什么想法吗？...

Answer 1

我的方法与你的非常相似：

dplyr::mutate(df, EuropeYN = ifelse((Region1 == "Europe" | Region2 == "Europe"), "yes", "no"))

Answer 2

两种方式：

从字面上检查两列中的每一列：

ifelse(df$Region1 == "Europe" | df$Region2 == "Europe", "yes", "no")
#  [1] "yes" "yes" "no"  "no"  "yes" "yes" "no"  "no"  "yes" "no"

这样的优点是更容易阅读（主观）并且非常清晰。

Select 列的范围并寻找相等性：

subset(df, select = Region1:Region2) == "Europe"
#    Region1 Region2
# 1     TRUE   FALSE
# 2    FALSE    TRUE
# 3    FALSE   FALSE
# 4    FALSE   FALSE
# 5     TRUE   FALSE
# 6    FALSE    TRUE
# 7    FALSE   FALSE
# 8    FALSE   FALSE
# 9     TRUE   FALSE
# 10   FALSE   FALSE

apply(subset(df, select = Region1:Region2) == "Europe", 1, any)
#     1     2     3     4     5     6     7     8     9    10 
#  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE

这允许我们使用 1 个或多个列。

其中任何一个都可以通过 df$EuropeYN <- ... 分配回框架中。

Answer 3

这里是向量化的基础 R 方式。

i <- rowSums(df[grep("Region", names(df))] == "Europe") > 0
df$EuropeYN <- c("no", "yes")[i + 1L]

Answer 4

有点晚了，但也许仍然值得一看：

library(dplyr)
library(stringr)
df %>%
  rowwise() %>%
  mutate(YN = +any(str_detect(c_across(Region1:Region2), 'Europe')))
# A tibble: 10 x 4
# Rowwise: 
      ID Region1       Region2    YN
   <int> <chr>         <chr>   <int>
 1     1 Europe        NA          1
 2     2 NA            Europe      1
 3     3 Asia          NA          0
 4     4 NA            NA          0
 5     5 Europe        NA          1
 6     6 NA            Europe      1
 7     7 Africa        NA          0
 8     8 NA            NA          0
 9     9 Europe        NA          1
10    10 North America NA          0

或者，没有 +：

df %>%
   rowwise() %>%
   mutate(YN = any(str_detect(c_across(Region1:Region2), 'Europe')))
# A tibble: 10 x 4
# Rowwise: 
      ID Region1       Region2 YN   
   <int> <chr>         <chr>   <lgl>
 1     1 Europe        NA      TRUE 
 2     2 NA            Europe  TRUE 
 3     3 Asia          NA      FALSE
 4     4 NA            NA      FALSE
 5     5 Europe        NA      TRUE 
 6     6 NA            Europe  TRUE 
 7     7 Africa        NA      FALSE
 8     8 NA            NA      FALSE
 9     9 Europe        NA      TRUE 
10    10 North America NA      FALSE

如果您想要 mutate 跨越多个列，您可以使用 starts_with（或者 contains 或 ends_with）来处理这些列：

df %>%
  rowwise() %>%
  mutate(YN = any(str_detect(c_across(starts_with('R')), 'Europe')))

Answer 5

我们可以在这里使用 if_any 作为 tidyverse

中的向量化选项

library(dplyr)
library(stringr)
df %>%
     mutate(YN = if_any(starts_with("Region"), str_detect, 'Europe'))
   ID       Region1 Region2    YN
1   1        Europe      NA  TRUE
2   2            NA  Europe  TRUE
3   3          Asia      NA FALSE
4   4            NA      NA FALSE
5   5        Europe      NA  TRUE
6   6            NA  Europe  TRUE
7   7        Africa      NA FALSE
8   8            NA      NA FALSE
9   9        Europe      NA  TRUE
10 10 North America      NA FALSE

或在base R

df$YN <-  Reduce(`|`, lapply(df[startsWith(names(df), 'Region')], 
        `%in%`, 'Europe'))

注意：使用逻辑标志而不是 "Yes"/"No"

更容易进行子集化

根据其他两列中的值创建 yes/no 列

Create yes/no column based on values in two other columns

string

if-statement

r

stringr

grepl