R 中 2 列之间的部分匹配标志
Flag for partial match between 2 columns in R
我有一个数据框,需要创建一个标志来指示 2 列之间存在部分匹配的实例,这里是代码和一些虚拟数据:
doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","veggies")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate veggies")
mydata <- data.frame(doc_id,word,text,stringsAsFactors = FALSE)
预期结果是相同的数据框,但有一个附加列显示单词和文本之间的匹配是否是部分匹配
doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","soup")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate soup")
partial_match <- c("1","0","0","1","0","1","0")
mydata2 <- data.frame(doc_id,word,text,partial_match,stringsAsFactors = FALSE)
我试过了
str_detect(mydata$word, mydata$text)
以及使用诸如 charmatch、pmatch、grep 和 grepl 等函数的类似事情都没有成功。
真实数据包含数千条记录,因此解决方案应该可以扩展。
谢谢。
经过长时间的尝试,我对字符串操作有了更多的了解并掌握了。可能不是最有效的方法,但它奏效了。
OBS:我用“¹”、“²”和“³”标记了评论,以便稍后解释。
parcial.m = numeric() # Create an empty vector
for(i in 1:nrow(mydata2)){
pattern = paste("([^\n]*)(",mydata2$word[i],")([^\n]*)",sep="")
# ¹
split = unlist(strsplit(mydata2$text[i], "[ [:punct:]]"))
# Split the text by punctuation and spaces, i.e. by words
word = grep(mydata2$word[i], split, value=TRUE)
# Select only the 'original' word
if(length(grep(mydata2$word[i], word))==0) {parcial.m[i]=0}
# ²
else {parcial.m[i] = !((gsub(pattern, "\1" , word)=="") & (gsub(pattern, "\3" , word)==""))}}
# ³
¹:模式是:一组(由(...)
标记)0个或更多(因此*
)除换行符以外的任何字符(因此^\n
, \n
是新行,^
是除它之外的所有内容),然后是包含搜索词的组,第三个等于第一个。
²:如果根本没有匹配项,我们就没有得到部分匹配项,所以我们希望值为 0。我们 select 这些情况使用的事实是,grep(mydata2$word[i], word)
将return 没有匹配项时长度为 0 的数字。
³:"\1"
和"\3"
select第1组和第3组pre-mentioned模式。如果它是一个完美的匹配,在我们“带走”搜索的单词(第 2 组)之后,word
(我称之为 'original word')不会有任何“剩余”,因此第 1 组和第 3 组将为空(即 = ""
)。该行代码正在测试两组是否同时为空(完全匹配),并将其取反(因此!)。由于我们已经用 if 语句将 no-matches 标记为 0,剩下的就是部分匹配。
我有一个数据框,需要创建一个标志来指示 2 列之间存在部分匹配的实例,这里是代码和一些虚拟数据:
doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","veggies")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate veggies")
mydata <- data.frame(doc_id,word,text,stringsAsFactors = FALSE)
预期结果是相同的数据框,但有一个附加列显示单词和文本之间的匹配是否是部分匹配
doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","soup")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate soup")
partial_match <- c("1","0","0","1","0","1","0")
mydata2 <- data.frame(doc_id,word,text,partial_match,stringsAsFactors = FALSE)
我试过了
str_detect(mydata$word, mydata$text)
以及使用诸如 charmatch、pmatch、grep 和 grepl 等函数的类似事情都没有成功。
真实数据包含数千条记录,因此解决方案应该可以扩展。
谢谢。
经过长时间的尝试,我对字符串操作有了更多的了解并掌握了。可能不是最有效的方法,但它奏效了。
OBS:我用“¹”、“²”和“³”标记了评论,以便稍后解释。
parcial.m = numeric() # Create an empty vector
for(i in 1:nrow(mydata2)){
pattern = paste("([^\n]*)(",mydata2$word[i],")([^\n]*)",sep="")
# ¹
split = unlist(strsplit(mydata2$text[i], "[ [:punct:]]"))
# Split the text by punctuation and spaces, i.e. by words
word = grep(mydata2$word[i], split, value=TRUE)
# Select only the 'original' word
if(length(grep(mydata2$word[i], word))==0) {parcial.m[i]=0}
# ²
else {parcial.m[i] = !((gsub(pattern, "\1" , word)=="") & (gsub(pattern, "\3" , word)==""))}}
# ³
¹:模式是:一组(由(...)
标记)0个或更多(因此*
)除换行符以外的任何字符(因此^\n
, \n
是新行,^
是除它之外的所有内容),然后是包含搜索词的组,第三个等于第一个。
²:如果根本没有匹配项,我们就没有得到部分匹配项,所以我们希望值为 0。我们 select 这些情况使用的事实是,grep(mydata2$word[i], word)
将return 没有匹配项时长度为 0 的数字。
³:"\1"
和"\3"
select第1组和第3组pre-mentioned模式。如果它是一个完美的匹配,在我们“带走”搜索的单词(第 2 组)之后,word
(我称之为 'original word')不会有任何“剩余”,因此第 1 组和第 3 组将为空(即 = ""
)。该行代码正在测试两组是否同时为空(完全匹配),并将其取反(因此!)。由于我们已经用 if 语句将 no-matches 标记为 0,剩下的就是部分匹配。