检测数据框列是否包含来自另一个数据框列的字符串值并添加包含检测到的字符串的列的 R 函数

R function that detects if a dataframe column contains string values from another dataframe column and adds a column that contains the detected str

我有两个数据框:

df1:

name
Apple page
Mango page
Lychee juice
Cranberry club

df2:

fruit
Apple
Grapes
Strawberry
Mango
lychee
cranberry

如果 df1$name 包含 df2$fruit 中的值(不区分大小写),我想向 df1 添加一个列,该列具有 df1$name 包含的 df2$fruit 中的值。 df1 将如下所示:

name category
Apple page Apple
Mango page Mango
Lychee juice lychee
Cranberry club cranberry

这应该有效:

library(stringr)
df1$category = str_extract(
  df1$name, 
  pattern = regex(paste(df2$fruit, collapse = "|"), ignore_case = TRUE)
)

df1
#             name  category
# 1     Apple page     Apple
# 2     Mango page     Mango
# 3   Lychee juice    Lychee
# 4 Cranberry club Cranberry

使用此数据:

df1 = read.table(text = 'name
Apple page
Mango page
Lychee juice
Cranberry club', header = T, sep = ";")

df2 = read.table(text = 'fruit
Apple
Grapes
Strawberry
Mango
lychee
cranberry', header = T, sep = ";")

首先,您可以为数据框的每个可能类别列一列,并将名称作为占位符(仅用 NA 填充)。然后对于这些列中的每一列,检查列名称(即类别)是否出现在名称中。把它变成一个长数据框,然后删除 FALSE 行——那些没有检测到名称中的类别的行。

library(tidyverse)

df1 <- tribble(
  ~name,
  "Apple page",
  "Mango page",
  "Lychee juice",
  "Cranberry club"
)
df2 <- tribble(
  ~fruit,
  "Apple",
  "Grapes",
  "Strawberry",
  "Mango",
  "lychee",
  "cranberry"
)

fruits <- df2$fruit %>%
  str_to_lower() %>% 
  set_names(rep(NA_character_, length(.)), .)

df1 %>% 
  add_column(!!!fruits) %>% 
  mutate(across(-name, ~str_detect(str_to_lower(name), cur_column()))) %>% 
  pivot_longer(-name, names_to = "category") %>% 
  filter(value) %>% 
  select(-value)

#> # A tibble: 4 × 2
#>   name           category 
#>   <chr>          <chr>    
#> 1 Apple page     apple    
#> 2 Mango page     mango    
#> 3 Lychee juice   lychee   
#> 4 Cranberry club cranberry