如何使用多列作为不同的字符串条件执行连接？

Question

我想执行一个复杂的连接，将多个列视为不同类型的条件。

我想根据每个水果是否包含字符串、可能包含的字符串以及不包含的字符串，为每个水果分配一个类别。

我有一个水果向量：

head(fruit) 
[1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper" "bilberry"

每个水果的分配标准详见此处：

 fruitAssignment <- data.frame(assignment = c('Apple','Berry','Black','Melon','Melon','Melon','Currant'),
       contains = c('apple','berry','black','honeydew','melon','cantaloupe','currant'),
       mayContain = c(NA,'black',NA,NA,NA,NA,NA),
       doesNotContain = c(NA,NA,'berry',NA,NA,NA,NA))

  assignment   contains mayContain doesNotContain
1      Apple      apple       <NA>           <NA>
2      Berry      berry      black           <NA>
3      Black      black       <NA>          berry
4      Melon   honeydew       <NA>           <NA>
5      Melon      melon       <NA>           <NA>
6      Melon cantaloupe       <NA>           <NA>
7    Currant    currant       <NA>           <NA>

例外情况：

如果没有符合条件的分配，我想将水果简单分配为'Fruit'。
如果有多个作业符合条件，我也想将其分配为'Fruit'。
条件不应区分大小写。

所以这个连接示例看起来像这样：

 dplyr::sample_n(fruit, size=5)
         fruit assignment
1   redcurrant    Currant
2 blackcurrant      Fruit
3    pineapple      Apple
4   blackberry      Berry
5      coconut      Fruit

无论使用什么软件包都可以。

Answer 1

我认为这里不适合加入，它更像是一个分类任务。使用正则表达式查找搜索词和分类之间的匹配项 table:

fruit <- c("redcurrant", "blackcurrant", "pineapple", "blackberry", "coconut")

fruitAssignment <- data.frame(assignment = c('Apple','Berry','Black','Melon','Melon','Melon','Currant'),
                              contains = c('apple','berry','black','honeydew','melon','cantaloupe','currant'),
                              mayContain = c(NA,'black',NA,NA,NA,NA,NA),
                              doesNotContain = c(NA,NA,'berry',NA,NA,NA,NA),
                              stringsAsFactors = FALSE)

library(dplyr)
library(tibble)

fun <- function(fruit, fruitAssignment) {

  fruitAssignment[,2:4] <- apply(fruitAssignment[,2:4],
                                 2,
                                 function(x, fruit) sapply(x, grepl, fruit, ignore.case = TRUE),
                                 fruit = fruit)
  fruitAssignment[is.na(fruitAssignment)] <- FALSE

  x <- fruitAssignment %>%
    filter(!doesNotContain, contains | mayContain)

  if (nrow(x) == 1)
    return(x$assignment)
  "Fruit"

}

sapply(fruit, fun, fruitAssignment) %>%
  enframe() %>%
  setNames(c("fruit", "assignment"))

# A tibble: 5 x 2
  fruit        assignment
  <chr>        <chr>     
1 redcurrant   Currant   
2 blackcurrant Fruit     
3 pineapple    Apple     
4 blackberry   Berry     
5 coconut      Fruit

如何使用多列作为不同的字符串条件执行连接？

How do I perform a join using multiple columns as different string criteria?

join

r

sqldf