模糊匹配并提取对话中重复出现的单词
Fuzzy-match and extract words repeated across turns in conversation
我正在研究会话轮流中的语音,并希望提取跨轮流重复的单词。我正在努力解决的任务是提取不完全重复的单词。
数据:
X <- data.frame(
speaker = c("A","B","A","B"),
speech = c("i'm gonna take a look you okay with that",
"sure looks good we can take a look you go first",
"okay last time I looked was different i think that is it yeah",
"yes you're right i think that's it"), stringsAsFactors = F
)
我有一个 for
循环成功地提取了 精确 次重复:
# initialize vectors:
pattern1 <- c()
extracted1 <- c()
# run `for` loop:
library(stringr)
for(i in 2:nrow(X)){
# define each 'speech` element as a pattern for the next `speech` element:
pattern1[i-1] <- paste0("\b(", paste0(unlist(str_split(X$speech[i-1], " ")), collapse = "|"), ")\b")
# extract all matched words:
extracted1[i] <- str_extract_all(X$speech[i], pattern1[i-1])
}
# result:
extracted1
[[1]]
NULL
[[2]]
[1] "take" "a" "look" "you"
[[3]]
character(0)
[[4]]
[1] "i" "think" "that" "it"
但是,我也想提取不精确的重复。例如,第 2 行中的 looks
是第 1 行中 look
的不精确重复,第 3 行中的 looked
模糊地重复第 2 行中的 looks
,并且 [第 4 行中的 =17=] 与第 3 行中的 yeah
近似匹配。
我最近遇到了agrep
,它用于近似匹配,但我不知道如何在这里使用它或者它是否是正确的方法。非常感谢任何帮助。
注意实际数据包含数千个具有高度不可预测的内容的轮询,因此无法事先定义所有可能变体的列表。
我认为使用 tidy 方法可以很好地完成这项工作。您已经解决的问题可以使用 tidytext
:
来完成(可能更快)
library(tidytext)
library(tidyverse)
# transform text into a tidy format
x_tidy <- X %>%
mutate(id = row_number()) %>%
unnest_tokens(output = "word", input = "speech")
# join data.frame with itself just moved by one id
x_tidy %>%
mutate(id_last = id - 1) %>%
semi_join(x_tidy, by = c("id_last" = "id", "word" = "word"))
#> speaker id word id_last
#> 2.5 B 2 take 1
#> 2.6 B 2 a 1
#> 2.7 B 2 look 1
#> 2.8 B 2 you 1
#> 4.3 B 4 i 3
#> 4.4 B 4 think 3
#> 4.6 B 4 it 3
当然你想要做的事情有点复杂。您突出显示的示例词不完全相同,但 Levenshtein 距离最大为 2:
adist(c("look", "looks", "looked"))
#> [,1] [,2] [,3]
#> [1,] 0 1 2
#> [2,] 1 0 2
#> [3,] 2 2 0
adist(c("yes", "yeah"))
#> [,1] [,2]
#> [1,] 0 2
#> [2,] 2 0
遵循相同的 tidyverse 逻辑,有一个很棒的包。不幸的是,相应函数中的 by
参数似乎无法处理两列(或者它对两列应用了模糊逻辑,因此 0 和 2 被视为相同?),所以这不起作用:
x_tidy %>%
mutate(id_last = id - 1) %>%
fuzzyjoin::stringdist_semi_join(x_tidy, by = c("word" = "word", "id_last" = "id"), max_dist = 2)
然而,使用循环我们无论如何都可以实现缺失的功能:
library(fuzzyjoin)
map_df(unique(x_tidy$id), function(i) {
current <- x_tidy %>%
filter(id == i)
last <- x_tidy %>%
filter(id == i - 1)
current %>%
fuzzyjoin::stringdist_semi_join(last, by = "word", max_dist = 2)
})
#> speaker id word
#> 2.1 B 2 looks
#> 2.2 B 2 good
#> 2.3 B 2 we
#> 2.4 B 2 can
#> 2.5 B 2 take
#> 2.6 B 2 a
#> 2.7 B 2 look
#> 2.8 B 2 you
#> 2.9 B 2 go
#> 3.2 A 3 time
#> 3.3 A 3 i
#> 3.4 A 3 looked
#> 3.5 A 3 was
#> 3.7 A 3 i
#> 3.10 A 3 is
#> 3.11 A 3 it
#> 4 B 4 yes
#> 4.3 B 4 i
#> 4.4 B 4 think
#> 4.5 B 4 that's
#> 4.6 B 4 it
由 reprex package (v2.0.0)
于 2021-04-22 创建
我不确定您的距离有多理想,以及您是否认为结果正确。或者,您可以在匹配之前尝试词干提取或词形还原,这可能会更好。我还为实现 stringsim_join 版本的包编写了一个新函数,它考虑了您要匹配的单词的长度。但是PR还没有通过
我正在研究会话轮流中的语音,并希望提取跨轮流重复的单词。我正在努力解决的任务是提取不完全重复的单词。
数据:
X <- data.frame(
speaker = c("A","B","A","B"),
speech = c("i'm gonna take a look you okay with that",
"sure looks good we can take a look you go first",
"okay last time I looked was different i think that is it yeah",
"yes you're right i think that's it"), stringsAsFactors = F
)
我有一个 for
循环成功地提取了 精确 次重复:
# initialize vectors:
pattern1 <- c()
extracted1 <- c()
# run `for` loop:
library(stringr)
for(i in 2:nrow(X)){
# define each 'speech` element as a pattern for the next `speech` element:
pattern1[i-1] <- paste0("\b(", paste0(unlist(str_split(X$speech[i-1], " ")), collapse = "|"), ")\b")
# extract all matched words:
extracted1[i] <- str_extract_all(X$speech[i], pattern1[i-1])
}
# result:
extracted1
[[1]]
NULL
[[2]]
[1] "take" "a" "look" "you"
[[3]]
character(0)
[[4]]
[1] "i" "think" "that" "it"
但是,我也想提取不精确的重复。例如,第 2 行中的 looks
是第 1 行中 look
的不精确重复,第 3 行中的 looked
模糊地重复第 2 行中的 looks
,并且 [第 4 行中的 =17=] 与第 3 行中的 yeah
近似匹配。
我最近遇到了agrep
,它用于近似匹配,但我不知道如何在这里使用它或者它是否是正确的方法。非常感谢任何帮助。
注意实际数据包含数千个具有高度不可预测的内容的轮询,因此无法事先定义所有可能变体的列表。
我认为使用 tidy 方法可以很好地完成这项工作。您已经解决的问题可以使用 tidytext
:
library(tidytext)
library(tidyverse)
# transform text into a tidy format
x_tidy <- X %>%
mutate(id = row_number()) %>%
unnest_tokens(output = "word", input = "speech")
# join data.frame with itself just moved by one id
x_tidy %>%
mutate(id_last = id - 1) %>%
semi_join(x_tidy, by = c("id_last" = "id", "word" = "word"))
#> speaker id word id_last
#> 2.5 B 2 take 1
#> 2.6 B 2 a 1
#> 2.7 B 2 look 1
#> 2.8 B 2 you 1
#> 4.3 B 4 i 3
#> 4.4 B 4 think 3
#> 4.6 B 4 it 3
当然你想要做的事情有点复杂。您突出显示的示例词不完全相同,但 Levenshtein 距离最大为 2:
adist(c("look", "looks", "looked"))
#> [,1] [,2] [,3]
#> [1,] 0 1 2
#> [2,] 1 0 2
#> [3,] 2 2 0
adist(c("yes", "yeah"))
#> [,1] [,2]
#> [1,] 0 2
#> [2,] 2 0
遵循相同的 tidyverse 逻辑,有一个很棒的包。不幸的是,相应函数中的 by
参数似乎无法处理两列(或者它对两列应用了模糊逻辑,因此 0 和 2 被视为相同?),所以这不起作用:
x_tidy %>%
mutate(id_last = id - 1) %>%
fuzzyjoin::stringdist_semi_join(x_tidy, by = c("word" = "word", "id_last" = "id"), max_dist = 2)
然而,使用循环我们无论如何都可以实现缺失的功能:
library(fuzzyjoin)
map_df(unique(x_tidy$id), function(i) {
current <- x_tidy %>%
filter(id == i)
last <- x_tidy %>%
filter(id == i - 1)
current %>%
fuzzyjoin::stringdist_semi_join(last, by = "word", max_dist = 2)
})
#> speaker id word
#> 2.1 B 2 looks
#> 2.2 B 2 good
#> 2.3 B 2 we
#> 2.4 B 2 can
#> 2.5 B 2 take
#> 2.6 B 2 a
#> 2.7 B 2 look
#> 2.8 B 2 you
#> 2.9 B 2 go
#> 3.2 A 3 time
#> 3.3 A 3 i
#> 3.4 A 3 looked
#> 3.5 A 3 was
#> 3.7 A 3 i
#> 3.10 A 3 is
#> 3.11 A 3 it
#> 4 B 4 yes
#> 4.3 B 4 i
#> 4.4 B 4 think
#> 4.5 B 4 that's
#> 4.6 B 4 it
由 reprex package (v2.0.0)
于 2021-04-22 创建我不确定您的距离有多理想,以及您是否认为结果正确。或者,您可以在匹配之前尝试词干提取或词形还原,这可能会更好。我还为实现 stringsim_join 版本的包编写了一个新函数,它考虑了您要匹配的单词的长度。但是PR还没有通过