检测数据框列是否包含来自另一个数据框列的字符串值并添加包含检测到的字符串的列的 R 函数
R function that detects if a dataframe column contains string values from another dataframe column and adds a column that contains the detected str
我有两个数据框:
df1:
name
Apple page
Mango page
Lychee juice
Cranberry club
df2:
fruit
Apple
Grapes
Strawberry
Mango
lychee
cranberry
如果 df1$name 包含 df2$fruit 中的值(不区分大小写),我想向 df1 添加一个列,该列具有 df1$name 包含的 df2$fruit 中的值。 df1 将如下所示:
name
category
Apple page
Apple
Mango page
Mango
Lychee juice
lychee
Cranberry club
cranberry
这应该有效:
library(stringr)
df1$category = str_extract(
df1$name,
pattern = regex(paste(df2$fruit, collapse = "|"), ignore_case = TRUE)
)
df1
# name category
# 1 Apple page Apple
# 2 Mango page Mango
# 3 Lychee juice Lychee
# 4 Cranberry club Cranberry
使用此数据:
df1 = read.table(text = 'name
Apple page
Mango page
Lychee juice
Cranberry club', header = T, sep = ";")
df2 = read.table(text = 'fruit
Apple
Grapes
Strawberry
Mango
lychee
cranberry', header = T, sep = ";")
首先,您可以为数据框的每个可能类别列一列,并将名称作为占位符(仅用 NA 填充)。然后对于这些列中的每一列,检查列名称(即类别)是否出现在名称中。把它变成一个长数据框,然后删除 FALSE
行——那些没有检测到名称中的类别的行。
library(tidyverse)
df1 <- tribble(
~name,
"Apple page",
"Mango page",
"Lychee juice",
"Cranberry club"
)
df2 <- tribble(
~fruit,
"Apple",
"Grapes",
"Strawberry",
"Mango",
"lychee",
"cranberry"
)
fruits <- df2$fruit %>%
str_to_lower() %>%
set_names(rep(NA_character_, length(.)), .)
df1 %>%
add_column(!!!fruits) %>%
mutate(across(-name, ~str_detect(str_to_lower(name), cur_column()))) %>%
pivot_longer(-name, names_to = "category") %>%
filter(value) %>%
select(-value)
#> # A tibble: 4 × 2
#> name category
#> <chr> <chr>
#> 1 Apple page apple
#> 2 Mango page mango
#> 3 Lychee juice lychee
#> 4 Cranberry club cranberry
我有两个数据框:
df1:
name |
---|
Apple page |
Mango page |
Lychee juice |
Cranberry club |
df2:
fruit |
---|
Apple |
Grapes |
Strawberry |
Mango |
lychee |
cranberry |
如果 df1$name 包含 df2$fruit 中的值(不区分大小写),我想向 df1 添加一个列,该列具有 df1$name 包含的 df2$fruit 中的值。 df1 将如下所示:
name | category |
---|---|
Apple page | Apple |
Mango page | Mango |
Lychee juice | lychee |
Cranberry club | cranberry |
这应该有效:
library(stringr)
df1$category = str_extract(
df1$name,
pattern = regex(paste(df2$fruit, collapse = "|"), ignore_case = TRUE)
)
df1
# name category
# 1 Apple page Apple
# 2 Mango page Mango
# 3 Lychee juice Lychee
# 4 Cranberry club Cranberry
使用此数据:
df1 = read.table(text = 'name
Apple page
Mango page
Lychee juice
Cranberry club', header = T, sep = ";")
df2 = read.table(text = 'fruit
Apple
Grapes
Strawberry
Mango
lychee
cranberry', header = T, sep = ";")
首先,您可以为数据框的每个可能类别列一列,并将名称作为占位符(仅用 NA 填充)。然后对于这些列中的每一列,检查列名称(即类别)是否出现在名称中。把它变成一个长数据框,然后删除 FALSE
行——那些没有检测到名称中的类别的行。
library(tidyverse)
df1 <- tribble(
~name,
"Apple page",
"Mango page",
"Lychee juice",
"Cranberry club"
)
df2 <- tribble(
~fruit,
"Apple",
"Grapes",
"Strawberry",
"Mango",
"lychee",
"cranberry"
)
fruits <- df2$fruit %>%
str_to_lower() %>%
set_names(rep(NA_character_, length(.)), .)
df1 %>%
add_column(!!!fruits) %>%
mutate(across(-name, ~str_detect(str_to_lower(name), cur_column()))) %>%
pivot_longer(-name, names_to = "category") %>%
filter(value) %>%
select(-value)
#> # A tibble: 4 × 2
#> name category
#> <chr> <chr>
#> 1 Apple page apple
#> 2 Mango page mango
#> 3 Lychee juice lychee
#> 4 Cranberry club cranberry