计数+识别两个字符串向量中的常用词 [R]
Count+Identify common words in two string vectors [R]
我如何编写一个 R 函数,它可以接受两个字符串向量和 returns 常用词的数量以及哪些常用词比较元素 1 从 stringvec1 到元素 1 的 stringvec2,元素 2 的 strinvec1 到stringvec2 的元素 2 等
假设我有这些数据:
#string vector 1
strvec1 <- c("Griffin Rahea Petersen Deana Franks Morgan","Story Keisha","Douglas Landon Lark","Kinsman Megan Thrall Michael Michels Breann","Gutierrez Mccoy Tyler Westbrook Grayson Swank Shirley Didas Moriah")
#string vector 2
strvec2 <- c("Griffin Morgan Rose Manuel","Van De Grift Sarah Sell William","Mark Landon Lark","Beerman Carlee Megan Thrall Michels","Mcmillan Tyler Jonathan Westbrook Grayson Didas Lloyd Connor")
理想情况下,我有一个函数可以 return 常用词的数量以及常用词是什么:
#Non working sample of how functions would ideally work
desiredfunction_numwords(strvec1,strvec2)
[1] 2 0 2 3 4
desiredfunction_matchwords(strvec1,strvec2)
[1] "Griffin Morgan" "" "Landon Lark" "Megan Thrall Michels" "Tyler Westbrook Grayson Didas"
您可以在每个单词处拆分字符串并执行操作。
在基数 R 中:
numwords <- function(str1, str2) {
mapply(function(x, y) length(intersect(x, y)),
strsplit(str1, ' '), strsplit(str2, ' '))
}
matchwords <- function(str1, str2) {
mapply(function(x, y) paste0(intersect(x, y),collapse = " "),
strsplit(str1, ' '), strsplit(str2, ' '))
}
numwords(strvec1, strvec2)
#[1] 2 0 2 3 4
matchwords(strvec1, strvec2)
#[1] "Griffin Morgan" "" "Landon Lark"
#[4] "Megan Thrall Michels" "Tyler Westbrook Grayson Didas"
您可以将 strvec1
用作 正则表达式模式 ,方法是 strsplit
将其分成单独的词,然后 paste
将这些词与交替标记 |
:
pattern <- paste0(unlist(strsplit(strvec1, " ")), collapse = "|")
您可以将此模式与 str_count
和 str_extract_all
一起使用:
library(stringr)
# counts:
str_count(strvec2, pattern)
[1] 2 0 2 3 4
# matches:
str_extract_all(strvec2, pattern)
[[1]]
[1] "Griffin" "Morgan"
[[2]]
character(0)
[[3]]
[1] "Landon" "Lark"
[[4]]
[1] "Megan" "Thrall" "Michels"
[[5]]
[1] "Tyler" "Westbrook" "Grayson" "Didas"
我如何编写一个 R 函数,它可以接受两个字符串向量和 returns 常用词的数量以及哪些常用词比较元素 1 从 stringvec1 到元素 1 的 stringvec2,元素 2 的 strinvec1 到stringvec2 的元素 2 等
假设我有这些数据:
#string vector 1
strvec1 <- c("Griffin Rahea Petersen Deana Franks Morgan","Story Keisha","Douglas Landon Lark","Kinsman Megan Thrall Michael Michels Breann","Gutierrez Mccoy Tyler Westbrook Grayson Swank Shirley Didas Moriah")
#string vector 2
strvec2 <- c("Griffin Morgan Rose Manuel","Van De Grift Sarah Sell William","Mark Landon Lark","Beerman Carlee Megan Thrall Michels","Mcmillan Tyler Jonathan Westbrook Grayson Didas Lloyd Connor")
理想情况下,我有一个函数可以 return 常用词的数量以及常用词是什么:
#Non working sample of how functions would ideally work
desiredfunction_numwords(strvec1,strvec2)
[1] 2 0 2 3 4
desiredfunction_matchwords(strvec1,strvec2)
[1] "Griffin Morgan" "" "Landon Lark" "Megan Thrall Michels" "Tyler Westbrook Grayson Didas"
您可以在每个单词处拆分字符串并执行操作。
在基数 R 中:
numwords <- function(str1, str2) {
mapply(function(x, y) length(intersect(x, y)),
strsplit(str1, ' '), strsplit(str2, ' '))
}
matchwords <- function(str1, str2) {
mapply(function(x, y) paste0(intersect(x, y),collapse = " "),
strsplit(str1, ' '), strsplit(str2, ' '))
}
numwords(strvec1, strvec2)
#[1] 2 0 2 3 4
matchwords(strvec1, strvec2)
#[1] "Griffin Morgan" "" "Landon Lark"
#[4] "Megan Thrall Michels" "Tyler Westbrook Grayson Didas"
您可以将 strvec1
用作 正则表达式模式 ,方法是 strsplit
将其分成单独的词,然后 paste
将这些词与交替标记 |
:
pattern <- paste0(unlist(strsplit(strvec1, " ")), collapse = "|")
您可以将此模式与 str_count
和 str_extract_all
一起使用:
library(stringr)
# counts:
str_count(strvec2, pattern)
[1] 2 0 2 3 4
# matches:
str_extract_all(strvec2, pattern)
[[1]]
[1] "Griffin" "Morgan"
[[2]]
character(0)
[[3]]
[1] "Landon" "Lark"
[[4]]
[1] "Megan" "Thrall" "Michels"
[[5]]
[1] "Tyler" "Westbrook" "Grayson" "Didas"