如何使用多列作为不同的字符串条件执行连接?
How do I perform a join using multiple columns as different string criteria?
我想执行一个复杂的连接,将多个列视为不同类型的条件。
我想根据每个水果是否包含字符串、可能包含的字符串以及不包含的字符串,为每个水果分配一个类别。
我有一个水果向量:
head(fruit)
[1] "apple" "apricot" "avocado" "banana" "bell pepper" "bilberry"
每个水果的分配标准详见此处:
fruitAssignment <- data.frame(assignment = c('Apple','Berry','Black','Melon','Melon','Melon','Currant'),
contains = c('apple','berry','black','honeydew','melon','cantaloupe','currant'),
mayContain = c(NA,'black',NA,NA,NA,NA,NA),
doesNotContain = c(NA,NA,'berry',NA,NA,NA,NA))
assignment contains mayContain doesNotContain
1 Apple apple <NA> <NA>
2 Berry berry black <NA>
3 Black black <NA> berry
4 Melon honeydew <NA> <NA>
5 Melon melon <NA> <NA>
6 Melon cantaloupe <NA> <NA>
7 Currant currant <NA> <NA>
例外情况:
- 如果没有符合条件的分配,我想将水果简单分配为'Fruit'。
- 如果有多个作业符合条件,我也想将其分配为'Fruit'。
- 条件不应区分大小写。
所以这个连接示例看起来像这样:
dplyr::sample_n(fruit, size=5)
fruit assignment
1 redcurrant Currant
2 blackcurrant Fruit
3 pineapple Apple
4 blackberry Berry
5 coconut Fruit
无论使用什么软件包都可以。
我认为这里不适合加入,它更像是一个分类任务。使用正则表达式查找搜索词和分类之间的匹配项 table:
fruit <- c("redcurrant", "blackcurrant", "pineapple", "blackberry", "coconut")
fruitAssignment <- data.frame(assignment = c('Apple','Berry','Black','Melon','Melon','Melon','Currant'),
contains = c('apple','berry','black','honeydew','melon','cantaloupe','currant'),
mayContain = c(NA,'black',NA,NA,NA,NA,NA),
doesNotContain = c(NA,NA,'berry',NA,NA,NA,NA),
stringsAsFactors = FALSE)
library(dplyr)
library(tibble)
fun <- function(fruit, fruitAssignment) {
fruitAssignment[,2:4] <- apply(fruitAssignment[,2:4],
2,
function(x, fruit) sapply(x, grepl, fruit, ignore.case = TRUE),
fruit = fruit)
fruitAssignment[is.na(fruitAssignment)] <- FALSE
x <- fruitAssignment %>%
filter(!doesNotContain, contains | mayContain)
if (nrow(x) == 1)
return(x$assignment)
"Fruit"
}
sapply(fruit, fun, fruitAssignment) %>%
enframe() %>%
setNames(c("fruit", "assignment"))
# A tibble: 5 x 2
fruit assignment
<chr> <chr>
1 redcurrant Currant
2 blackcurrant Fruit
3 pineapple Apple
4 blackberry Berry
5 coconut Fruit
我想执行一个复杂的连接,将多个列视为不同类型的条件。
我想根据每个水果是否包含字符串、可能包含的字符串以及不包含的字符串,为每个水果分配一个类别。
我有一个水果向量:
head(fruit)
[1] "apple" "apricot" "avocado" "banana" "bell pepper" "bilberry"
每个水果的分配标准详见此处:
fruitAssignment <- data.frame(assignment = c('Apple','Berry','Black','Melon','Melon','Melon','Currant'),
contains = c('apple','berry','black','honeydew','melon','cantaloupe','currant'),
mayContain = c(NA,'black',NA,NA,NA,NA,NA),
doesNotContain = c(NA,NA,'berry',NA,NA,NA,NA))
assignment contains mayContain doesNotContain
1 Apple apple <NA> <NA>
2 Berry berry black <NA>
3 Black black <NA> berry
4 Melon honeydew <NA> <NA>
5 Melon melon <NA> <NA>
6 Melon cantaloupe <NA> <NA>
7 Currant currant <NA> <NA>
例外情况:
- 如果没有符合条件的分配,我想将水果简单分配为'Fruit'。
- 如果有多个作业符合条件,我也想将其分配为'Fruit'。
- 条件不应区分大小写。
所以这个连接示例看起来像这样:
dplyr::sample_n(fruit, size=5)
fruit assignment
1 redcurrant Currant
2 blackcurrant Fruit
3 pineapple Apple
4 blackberry Berry
5 coconut Fruit
无论使用什么软件包都可以。
我认为这里不适合加入,它更像是一个分类任务。使用正则表达式查找搜索词和分类之间的匹配项 table:
fruit <- c("redcurrant", "blackcurrant", "pineapple", "blackberry", "coconut")
fruitAssignment <- data.frame(assignment = c('Apple','Berry','Black','Melon','Melon','Melon','Currant'),
contains = c('apple','berry','black','honeydew','melon','cantaloupe','currant'),
mayContain = c(NA,'black',NA,NA,NA,NA,NA),
doesNotContain = c(NA,NA,'berry',NA,NA,NA,NA),
stringsAsFactors = FALSE)
library(dplyr)
library(tibble)
fun <- function(fruit, fruitAssignment) {
fruitAssignment[,2:4] <- apply(fruitAssignment[,2:4],
2,
function(x, fruit) sapply(x, grepl, fruit, ignore.case = TRUE),
fruit = fruit)
fruitAssignment[is.na(fruitAssignment)] <- FALSE
x <- fruitAssignment %>%
filter(!doesNotContain, contains | mayContain)
if (nrow(x) == 1)
return(x$assignment)
"Fruit"
}
sapply(fruit, fun, fruitAssignment) %>%
enframe() %>%
setNames(c("fruit", "assignment"))
# A tibble: 5 x 2
fruit assignment
<chr> <chr>
1 redcurrant Currant
2 blackcurrant Fruit
3 pineapple Apple
4 blackberry Berry
5 coconut Fruit