如何提取数据框列中的所有匹配模式(字符串中的单词)?
How to extract all matching patterns (words in a string) in a dataframe column?
我有两个数据框。一个 (txt.df) 有一列包含我想从 (text) 中提取短语的文本。另一个 (wrd.df) 有一列包含短语 (phrase)。两者都是具有复杂文本和字符串的大数据框,但可以说:
txt.df <- data.frame(id = c(1, 2, 3, 4, 5),
text = c("they love cats and dogs", "he is drinking juice",
"the child is having a nap on the bed", "they jump on the bed and break it",
"the cat is sleeping on the bed"))
wrd.df <- data.frame(label = c('a', 'b', 'c', 'd', 'e', 'd'),
phrase = c("love cats", "love dogs", "juice drinking", "nap on the bed", "break the bed",
"sleeping on the bed"))
我最终需要的是 txt.df 和另一列,其中包含检测到的短语的标签。
我尝试在 wrd.df 中创建一个专栏,我在其中标记了这样的短语
wrd.df$token <- sapply(wrd.df$phrase, function(x) unlist(strsplit(x, split = " ")))
然后尝试编写自定义函数以使用 grepl/str_detect 应用到令牌列
得到那些都是真的名字(标签)
Extract.Fun <- function(text, df, label, token){
for (i in token) {
truefalse[i] <- sapply(token[i], function (x) grepl(x, text))
truenames[i] <- names(which(truefalse[i] == T))
removedup[i] <- unique(truenames[i])
return(removedup)
}
然后在我的 txt.df$text 上应用此自定义函数以创建一个带有标签的新列。
txt.df$extract <- sapply(txt.df$text, function (x) Extract.Fun(x, wrd.df, "label", "token"))
但我不擅长自定义函数,真的卡住了。我将不胜感激任何帮助。
P.S。如果我也可以像 "drink juice" 和 "broke the bed" 这样的部分匹配,那就太好了......但这不是优先事项......与原始匹配很好。
如果您需要匹配精确的短语,fuzzyjoin
包中的 regex_join()
就是您所需要的。
fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "phrase"), mode = "left" )
id text label phrase
1 1 they love cats and dogs a love cats
2 2 he is drinking juice <NA> <NA>
3 3 the child is having a nap on the bed d nap on the bed
4 4 they jump on the bed and break it <NA> <NA>
5 5 the cat is sleeping on the bed d sleeping on the bed
如果你想匹配所有的词,我想你可以用涵盖这种行为的短语构建一个正则表达式...
更新
#build regex for phrases
#done by splitting the phrases to individual words, and then paste the regex together
wrd.df$regex <- unlist( lapply( lapply( strsplit( wrd.df$phrase, " "),
function(x) paste0( "(?=.*", x, ")", collapse = "" ) ),
function(x) paste0( "^", x, ".*$") ) )
fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "regex"), mode = "left" )
id text label phrase regex
1 1 they love cats and dogs a love cats ^(?=.*love)(?=.*cats).*$
2 1 they love cats and dogs b love dogs ^(?=.*love)(?=.*dogs).*$
3 2 he is drinking juice c juice drinking ^(?=.*juice)(?=.*drinking).*$
4 3 the child is having a nap on the bed d nap on the bed ^(?=.*nap)(?=.*on)(?=.*the)(?=.*bed).*$
5 4 they jump on the bed and break it e break the bed ^(?=.*break)(?=.*the)(?=.*bed).*$
6 5 the cat is sleeping on the bed d sleeping on the bed ^(?=.*sleeping)(?=.*on)(?=.*the)(?=.*bed).*$
我有两个数据框。一个 (txt.df) 有一列包含我想从 (text) 中提取短语的文本。另一个 (wrd.df) 有一列包含短语 (phrase)。两者都是具有复杂文本和字符串的大数据框,但可以说:
txt.df <- data.frame(id = c(1, 2, 3, 4, 5),
text = c("they love cats and dogs", "he is drinking juice",
"the child is having a nap on the bed", "they jump on the bed and break it",
"the cat is sleeping on the bed"))
wrd.df <- data.frame(label = c('a', 'b', 'c', 'd', 'e', 'd'),
phrase = c("love cats", "love dogs", "juice drinking", "nap on the bed", "break the bed",
"sleeping on the bed"))
我最终需要的是 txt.df 和另一列,其中包含检测到的短语的标签。
我尝试在 wrd.df 中创建一个专栏,我在其中标记了这样的短语
wrd.df$token <- sapply(wrd.df$phrase, function(x) unlist(strsplit(x, split = " ")))
然后尝试编写自定义函数以使用 grepl/str_detect 应用到令牌列 得到那些都是真的名字(标签)
Extract.Fun <- function(text, df, label, token){
for (i in token) {
truefalse[i] <- sapply(token[i], function (x) grepl(x, text))
truenames[i] <- names(which(truefalse[i] == T))
removedup[i] <- unique(truenames[i])
return(removedup)
}
然后在我的 txt.df$text 上应用此自定义函数以创建一个带有标签的新列。
txt.df$extract <- sapply(txt.df$text, function (x) Extract.Fun(x, wrd.df, "label", "token"))
但我不擅长自定义函数,真的卡住了。我将不胜感激任何帮助。 P.S。如果我也可以像 "drink juice" 和 "broke the bed" 这样的部分匹配,那就太好了......但这不是优先事项......与原始匹配很好。
如果您需要匹配精确的短语,fuzzyjoin
包中的 regex_join()
就是您所需要的。
fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "phrase"), mode = "left" )
id text label phrase
1 1 they love cats and dogs a love cats
2 2 he is drinking juice <NA> <NA>
3 3 the child is having a nap on the bed d nap on the bed
4 4 they jump on the bed and break it <NA> <NA>
5 5 the cat is sleeping on the bed d sleeping on the bed
如果你想匹配所有的词,我想你可以用涵盖这种行为的短语构建一个正则表达式...
更新
#build regex for phrases
#done by splitting the phrases to individual words, and then paste the regex together
wrd.df$regex <- unlist( lapply( lapply( strsplit( wrd.df$phrase, " "),
function(x) paste0( "(?=.*", x, ")", collapse = "" ) ),
function(x) paste0( "^", x, ".*$") ) )
fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "regex"), mode = "left" )
id text label phrase regex
1 1 they love cats and dogs a love cats ^(?=.*love)(?=.*cats).*$
2 1 they love cats and dogs b love dogs ^(?=.*love)(?=.*dogs).*$
3 2 he is drinking juice c juice drinking ^(?=.*juice)(?=.*drinking).*$
4 3 the child is having a nap on the bed d nap on the bed ^(?=.*nap)(?=.*on)(?=.*the)(?=.*bed).*$
5 4 they jump on the bed and break it e break the bed ^(?=.*break)(?=.*the)(?=.*bed).*$
6 5 the cat is sleeping on the bed d sleeping on the bed ^(?=.*sleeping)(?=.*on)(?=.*the)(?=.*bed).*$