删除同一列有匹配词，多列有匹配值的行

Question

我有一个包含超过 20000 行 (data3) 的数据框，其中一个列名为 "collector"。在这个专栏中，我有单词字符串，例如："Ruiz Galvis Marta"。我需要将每一行与我的数据框中的所有其他行进行比较，并删除 df$collector 列中的一个或多个单词与所有其他行中同一列中的单词以及值匹配的行在 "sample" 列和 "number" 列中。即：

INPUT:

Collector                   Times     sample   number
Ruiz Galvis Marta            9         SP.1      one        
Smith et al Marta            8         SP.2      two
Ruiz Andres Allan            4         SP.1      one


EXPECTED OUTPUT

Collector                   Times     sample    number           
Smith et al Marta             8         SP.2      two

感谢您的帮助！

Answer 1

可能会非常慢但是

dd <- data.frame(Collector = c('Ruiz Galvis Marta', 'Smith et al Marta', 'Ruiz Andres Allan'),
                 stringsAsFactors = FALSE)

## create a matrix with the words by column
tt <- strsplit(dd$Collector, '\s+')
mm <- do.call('rbind', lapply(tt, `length<-`, max(lengths(tt))))

## remove all duplicates
dd[rowSums(apply(mm, 2, function(x)
  duplicated(x) | duplicated(x, fromLast = TRUE))) == 0, ]

# [1] "Smith et al Marta"

删除同一列有匹配词，多列有匹配值的行

Delete rows with matching words in the same column, and matching values in multiple columns

r

string-matching

delete-row