字符向量列表与字符向量的模糊字符串匹配
Fuzzy string matching of a list of character vectors to a character vector
我有一个字符向量列表和一个字符向量。我想在列表的每个元素(字符向量)与字符向量的每个元素(字符串)和 return 每个组合的最大相似度得分之间执行模糊匹配。下面是一个玩具示例:
a <- c("brown fox", "lazy dog", "white cat", "I don't know", "sunset", "never mind", "excuse me")
b <- c("very late", "do not cross", "sunrise", "long vacation")
c <- c("toy example", "green apple", "tall building", "good rating", "accommodating")
mylist <- list(a,b,c)
charvec <- c("brown dog", "lazy cat", "white dress", "I know that", "excuse me please", "tall person", "new building", "good example", "green with envy", "zebra crossing")
现在,我想将 mylist
中的每个元素与 charvec
中的第一个字符串和 return 7 个分数中的最大相似度分数进行模糊匹配。同样,我想获得 mylist
和 charvec
.
的每个组合的分数
我目前的尝试:
将 charvec 中的字符串转换为空数据框的列名
df <- setNames(data.frame(matrix(ncol = 10, nrow = 3)), c(charvec))
使用 RecordLinkage 包中的 jarowinkler 距离计算每个组合之间的最大相似度分数(或者如果有更好的匹配短语距离度量!!)
for (j in seq_along(mylist)) {
for (i in length(ncol(df))) {
df[[i,j]] <- max(jarowinkler(names(df)[i], mylist[[j]]))
}
}
但不幸的是,我在第一行只得到 3 个分数,其余值为 NA。
如有任何帮助,我们将不胜感激。
首先是一个辅助函数,returns 给定要检查的字符向量的单词的最佳匹配。我正在使用 purrr 包来实现映射功能,因为我更喜欢它而不是循环。
library(purrr)
library(magrittr)
library(RecordLinkage)
a <- c("brown fox", "lazy dog", "white cat", "I don't know", "sunset", "never mind", "excuse me")
charvec <- c("brown dog", "lazy cat", "white dress", "I know that", "excuse me please", "tall person", "new building", "good example", "green with envy", "zebra crossing")
getBestMatch <- function(word, vector){
purrr::map_dbl(charvec, ~RecordLinkage::jarowinkler(word, .x)) %>%
magrittr::set_names(charvec) %>%
which.max %>%
names
}
运行 该函数产生以下输出:
> getBestMatch("brown fox", charvec)
[1] "brown dog"
现在我们有了辅助函数,只需在向量的元素上调用它即可。
>map_chr(a, ~ getBestMatch(.x, charvec))
[1] "brown dog" "lazy cat" "white dress" "I know that"
[5] "I know that" "new building" "excuse me please"
library(stringdist)
dist <- stringdistmatrix( df$text, charvec ,method = "lcs" )
row.names( dist ) <- as.character( df$text )
colnames( dist ) <- charvec
我在这个例子中使用了lcs
; L最长C一般S子串距离。
我鼓励您检查其他方法。 ?"stringdist-metrics"
距离越小,匹配越好...
> dist
# brown dog lazy cat white dress I know that excuse me please tall person new building good example green with envy zebra crossing
# brown fox 4 15 16 14 23 14 17 15 18 15
# lazy dog 9 6 15 15 20 13 14 18 21 14
# white cat 14 9 8 12 19 16 17 17 16 17
# I don't know 13 16 19 11 24 17 18 20 19 20
# sunset 13 12 13 13 16 13 14 16 17 16
# never mind 13 16 15 17 18 15 12 18 15 14
# excuse me 16 15 14 18 7 16 17 13 16 17
# very late 14 9 14 14 15 16 15 15 16 17
# do not cross 13 16 13 15 22 15 20 18 21 14
# sunrise 14 15 14 16 17 14 15 17 16 17
# long vacation 14 11 22 16 25 16 17 19 20 19
# toy example 16 13 16 16 15 14 19 5 20 21
# green apple 14 15 16 16 15 16 17 11 12 21
# tall building 16 17 18 20 25 12 7 21 22 17
# good rating 14 13 18 14 23 16 15 11 18 15
# accommodating 16 13 22 18 23 18 17 17 24 15
使用 purrr
包
mylist <- setNames(mylist, c('a', 'b', 'c'))
library(purrr)
map_dfr(charvec,
function(wrd, vec_list){
setNames(map_df(vec_list, ~max(jarowinkler(wrd, .x))),
names(vec_list)
)
},
mylist)
# A tibble: 10 x 3
a b c
<dbl> <dbl> <dbl>
1 0.911 0.580 0.603
2 0.85 0.713 0.603
3 0.842 0.557 0.515
4 0.657 0.490 0.409
5 0.912 0.489 0.659
6 0.538 0.546 0.801
7 0.716 0.547 0.740
8 0.591 0.524 0.856
9 0.675 0.509 0.821
10 0.619 0.587 0.630
如果你喜欢宽幅的:
map_dfc(charvec,
function(wrd, vec_list) {
set_names(list(map_dbl(vec_list, ~max(jarowinkler(wrd, .x)))),
wrd)
},
mylist
)
# A tibble: 3 x 10
`brown dog` `lazy cat` `white dress` `I know that` `excuse me plea~ `tall person` `new building` `good example`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.911 0.85 0.842 0.657 0.912 0.538 0.716 0.591
2 0.580 0.713 0.557 0.490 0.489 0.546 0.547 0.524
3 0.603 0.603 0.515 0.409 0.659 0.801 0.740 0.856
# ... with 2 more variables: `green with envy` <dbl>, `zebra crossing` <dbl>
我有一个字符向量列表和一个字符向量。我想在列表的每个元素(字符向量)与字符向量的每个元素(字符串)和 return 每个组合的最大相似度得分之间执行模糊匹配。下面是一个玩具示例:
a <- c("brown fox", "lazy dog", "white cat", "I don't know", "sunset", "never mind", "excuse me")
b <- c("very late", "do not cross", "sunrise", "long vacation")
c <- c("toy example", "green apple", "tall building", "good rating", "accommodating")
mylist <- list(a,b,c)
charvec <- c("brown dog", "lazy cat", "white dress", "I know that", "excuse me please", "tall person", "new building", "good example", "green with envy", "zebra crossing")
现在,我想将 mylist
中的每个元素与 charvec
中的第一个字符串和 return 7 个分数中的最大相似度分数进行模糊匹配。同样,我想获得 mylist
和 charvec
.
我目前的尝试:
将 charvec 中的字符串转换为空数据框的列名
df <- setNames(data.frame(matrix(ncol = 10, nrow = 3)), c(charvec))
使用 RecordLinkage 包中的 jarowinkler 距离计算每个组合之间的最大相似度分数(或者如果有更好的匹配短语距离度量!!)
for (j in seq_along(mylist)) {
for (i in length(ncol(df))) {
df[[i,j]] <- max(jarowinkler(names(df)[i], mylist[[j]]))
}
}
但不幸的是,我在第一行只得到 3 个分数,其余值为 NA。
如有任何帮助,我们将不胜感激。
首先是一个辅助函数,returns 给定要检查的字符向量的单词的最佳匹配。我正在使用 purrr 包来实现映射功能,因为我更喜欢它而不是循环。
library(purrr)
library(magrittr)
library(RecordLinkage)
a <- c("brown fox", "lazy dog", "white cat", "I don't know", "sunset", "never mind", "excuse me")
charvec <- c("brown dog", "lazy cat", "white dress", "I know that", "excuse me please", "tall person", "new building", "good example", "green with envy", "zebra crossing")
getBestMatch <- function(word, vector){
purrr::map_dbl(charvec, ~RecordLinkage::jarowinkler(word, .x)) %>%
magrittr::set_names(charvec) %>%
which.max %>%
names
}
运行 该函数产生以下输出:
> getBestMatch("brown fox", charvec)
[1] "brown dog"
现在我们有了辅助函数,只需在向量的元素上调用它即可。
>map_chr(a, ~ getBestMatch(.x, charvec))
[1] "brown dog" "lazy cat" "white dress" "I know that"
[5] "I know that" "new building" "excuse me please"
library(stringdist)
dist <- stringdistmatrix( df$text, charvec ,method = "lcs" )
row.names( dist ) <- as.character( df$text )
colnames( dist ) <- charvec
我在这个例子中使用了lcs
; L最长C一般S子串距离。
我鼓励您检查其他方法。 ?"stringdist-metrics"
距离越小,匹配越好...
> dist
# brown dog lazy cat white dress I know that excuse me please tall person new building good example green with envy zebra crossing
# brown fox 4 15 16 14 23 14 17 15 18 15
# lazy dog 9 6 15 15 20 13 14 18 21 14
# white cat 14 9 8 12 19 16 17 17 16 17
# I don't know 13 16 19 11 24 17 18 20 19 20
# sunset 13 12 13 13 16 13 14 16 17 16
# never mind 13 16 15 17 18 15 12 18 15 14
# excuse me 16 15 14 18 7 16 17 13 16 17
# very late 14 9 14 14 15 16 15 15 16 17
# do not cross 13 16 13 15 22 15 20 18 21 14
# sunrise 14 15 14 16 17 14 15 17 16 17
# long vacation 14 11 22 16 25 16 17 19 20 19
# toy example 16 13 16 16 15 14 19 5 20 21
# green apple 14 15 16 16 15 16 17 11 12 21
# tall building 16 17 18 20 25 12 7 21 22 17
# good rating 14 13 18 14 23 16 15 11 18 15
# accommodating 16 13 22 18 23 18 17 17 24 15
使用 purrr
包
mylist <- setNames(mylist, c('a', 'b', 'c'))
library(purrr)
map_dfr(charvec,
function(wrd, vec_list){
setNames(map_df(vec_list, ~max(jarowinkler(wrd, .x))),
names(vec_list)
)
},
mylist)
# A tibble: 10 x 3
a b c
<dbl> <dbl> <dbl>
1 0.911 0.580 0.603
2 0.85 0.713 0.603
3 0.842 0.557 0.515
4 0.657 0.490 0.409
5 0.912 0.489 0.659
6 0.538 0.546 0.801
7 0.716 0.547 0.740
8 0.591 0.524 0.856
9 0.675 0.509 0.821
10 0.619 0.587 0.630
如果你喜欢宽幅的:
map_dfc(charvec,
function(wrd, vec_list) {
set_names(list(map_dbl(vec_list, ~max(jarowinkler(wrd, .x)))),
wrd)
},
mylist
)
# A tibble: 3 x 10
`brown dog` `lazy cat` `white dress` `I know that` `excuse me plea~ `tall person` `new building` `good example`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.911 0.85 0.842 0.657 0.912 0.538 0.716 0.591
2 0.580 0.713 0.557 0.490 0.489 0.546 0.547 0.524
3 0.603 0.603 0.515 0.409 0.659 0.801 0.740 0.856
# ... with 2 more variables: `green with envy` <dbl>, `zebra crossing` <dbl>