如何将数据框与列表进行比较,并且数据框中的 return 值与列表匹配?

How to compare a data frame to a list, and return values in the data frame matching the list?

新手 R 问题总数。我有一个ID和注释的数据框df:

ID    Notes
1     dogs are friendly
2     dogs and cats are pets
3     cows live on farms
4     cats and cows start with c

我还有另一个值列表"animals"

cats
cows

我想在我的数据框中添加另一列 "match",其中包含注释中的所有动物,例如

ID    Notes                        Matches
1     dogs are friendly            
2     dogs and cats are pets       cats
3     cows live on farms           cows
4     cats and cows start with c   cats, cows

到目前为止,我唯一的幸运是使用 grepl 来 return 如果有任何匹配项:

grepl(paste(animals,collapse="|"),df$Notes,ignore.case = T)

如何 return 值来代替?

更新
我的数据框中有一些行,其中有多个猫实例,例如,在我的笔记中:

ID    Notes                             Matches
1     dogs are friendly            
2     dogs and cats are pets            cats
3     cows live on farms                cows
4     cats and cats cows start with c   cats, cows

我只想return匹配的一个实例。 @LachlanO 让我非常了解他的解决方案,但我得到:

[1] "NA, NA"                      "cats, NA"                    "NA, cows"                    "c(\"cats\", \"cats\"), cows"

我怎样才能 return 只有不同的匹配项?

编辑: 添加了一个 unique 操作来处理重复匹配项。

我可以让你开始,然后给你指明方向:)

下面使用 stringr::str_extract_all 来提取我们需要的相关位,但不幸的是它给我们留下了我们不需要的位,最明显的是当它是空白的时候。我们自定义函数中间的 unique 函数只是确保我们逐个元素地进行唯一匹配。

ID = seq(1,4)
Notes <- c(
  "dogs are friendly",
  "dogs and cats are pets",
  "cows live on farms",
  "cats and cows start with c "
)
df <- data.frame(ID, Notes)

animals = c("cats", "cows")

matches <- as.data.frame(sapply(animals, function(x){sapply(stringr::str_extract_all(df$Notes, x), unique)}, simplify = TRUE))
matches[matches == "character(0)"] <- NA

apply(matches, 1, paste, collapse = ", ")
[1] "NA, NA"     "cats, NA"   "NA, cows"   "cats, cows"

您可以将其设置为您的额外列,但由于这些 NA 而不是很好。如果有一个忽略 NA 的粘贴函数,我们就会被设置。

幸运的是另一个用户已经解决了这个问题:) Check out this answer here.

结合以上应该会给你一个合适的解决方案!

我会这样做:

animals = c("cats", "cows")
reg = paste(animals, collapse = "|")

library(stringr)
matches = str_extract_all(Notes, reg)
matches = lapply(matches, unique)
matches = sapply(matches, paste, collapse = ",")

df$matches = matches
df
#   ID                       Notes   matches
# 1  1           dogs are friendly          
# 2  2      dogs and cats are pets      cats
# 3  3          cows live on farms      cows
# 4  4 cats and cows start with c  cats,cows

如果你想花哨一点,可以在正则表达式上粘贴单词边界,例如 reg = paste("\b", animals, "\b", collapse = "|") 以避免提取单词的中间部分。


使用 LachlanO 提供的数据:

ID = seq(1,4)
Notes <- c(
  "dogs are friendly",
  "dogs and cats are pets",
  "cows live on farms",
  "cats and cows start with c "
)
df <- data.frame(ID, Notes)

您可以使用gsub一次获得所有动物:

gsub(".*?(cows|cats )|.*","\1",do.call(paste,df),perl = T)
[1] ""          "cats "     "cows"      "cats cows"

因此在一个通道中写入:

transform(df,matches=gsub(".*?(cows|cats )|.*","\1",do.call(paste,df),perl = T))
  ID                       Notes   matches
1  1           dogs are friendly          
2  2      dogs and cats are pets     cats 
3  3          cows live on farms      cows
4  4 cats and cows start with c  cats cows