R 中 2 列之间的部分匹配标志

Flag for partial match between 2 columns in R

我有一个数据框,需要创建一个标志来指示 2 列之间存在部分匹配的实例,这里是代码和一些虚拟数据:

doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","veggies")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate veggies") 
mydata <- data.frame(doc_id,word,text,stringsAsFactors = FALSE)

预期结果是相同的数据框,但有一个附加列显示单词和文本之间的匹配是否是部分匹配

doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","soup")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate soup") 
partial_match <- c("1","0","0","1","0","1","0")
mydata2 <- data.frame(doc_id,word,text,partial_match,stringsAsFactors = FALSE)

我试过了

str_detect(mydata$word, mydata$text)

以及使用诸如 charmatch、pmatch、grep 和 grepl 等函数的类似事情都没有成功。

真实数据包含数千条记录,因此解决方案应该可以扩展。

谢谢。

经过长时间的尝试,我对字符串操作有了更多的了解并掌握了。可能不是最有效的方法,但它奏效了。

OBS:我用“¹”、“²”和“³”标记了评论,以便稍后解释。

parcial.m = numeric() # Create an empty vector

for(i in 1:nrow(mydata2)){
  pattern = paste("([^\n]*)(",mydata2$word[i],")([^\n]*)",sep="")
  # ¹

  split = unlist(strsplit(mydata2$text[i], "[ [:punct:]]"))
  # Split the text by punctuation and spaces, i.e. by words

  word = grep(mydata2$word[i], split, value=TRUE)
  # Select only the 'original' word
  
  if(length(grep(mydata2$word[i], word))==0) {parcial.m[i]=0}
  # ²

  else {parcial.m[i] = !((gsub(pattern, "\1" , word)=="") & (gsub(pattern, "\3" , word)==""))}}
  # ³

¹:模式是:一组(由(...)标记)0个或更多(因此*)除换行符以外的任何字符(因此^\n , \n 是新行,^ 是除它之外的所有内容),然后是包含搜索词的组,第三个等于第一个。

²:如果根本没有匹配项,我们就没有得到部分匹配项,所以我们希望值为 0。我们 select 这些情况使用的事实是,grep(mydata2$word[i], word) 将return 没有匹配项时长度为 0 的数字。

³:"\1""\3"select第1组和第3组pre-mentioned模式。如果它是一个完美的匹配,在我们“带走”搜索的单词(第 2 组)之后,word(我称之为 'original word')不会有任何“剩余”,因此第 1 组和第 3 组将为空(即 = "")。该行代码正在测试两组是否同时为空(完全匹配),并将其取反(因此!)。由于我们已经用 if 语句将 no-matches 标记为 0,剩下的就是部分匹配。