模糊匹配并提取对话中重复出现的单词

Question

我正在研究会话轮流中的语音，并希望提取跨轮流重复的单词。我正在努力解决的任务是提取不完全重复的单词。

数据：

X <- data.frame(
  speaker = c("A","B","A","B"),
  speech = c("i'm gonna take a look you okay with that", 
             "sure looks good we can take a look you go first",
             "okay last time I looked was different i think that is it yeah",
             "yes you're right i think that's it"), stringsAsFactors = F
)

我有一个 for 循环成功地提取了精确次重复：

# initialize vectors:
pattern1 <- c()
extracted1 <- c()

# run `for` loop:
library(stringr)
for(i in 2:nrow(X)){
  # define each 'speech` element as a pattern for the next `speech` element:
  pattern1[i-1] <- paste0("\b(", paste0(unlist(str_split(X$speech[i-1], " ")), collapse = "|"), ")\b")
  # extract all matched words:
  extracted1[i] <- str_extract_all(X$speech[i], pattern1[i-1])
}
# result:
extracted1
[[1]]
NULL

[[2]]
[1] "take" "a"    "look" "you" 

[[3]]
character(0)

[[4]]
[1] "i"     "think" "that"  "it"

但是，我也想提取不精确的重复。例如，第 2 行中的 looks 是第 1 行中 look 的不精确重复，第 3 行中的 looked 模糊地重复第 2 行中的 looks，并且 [第 4 行中的 =17=] 与第 3 行中的 yeah 近似匹配。我最近遇到了agrep，它用于近似匹配，但我不知道如何在这里使用它或者它是否是正确的方法。非常感谢任何帮助。

注意实际数据包含数千个具有高度不可预测的内容的轮询，因此无法事先定义所有可能变体的列表。

Answer 1

我认为使用 tidy 方法可以很好地完成这项工作。您已经解决的问题可以使用 tidytext:

来完成（可能更快）

library(tidytext)
library(tidyverse)
# transform text into a tidy format
x_tidy <- X %>% 
  mutate(id = row_number()) %>% 
  unnest_tokens(output = "word", input = "speech")

# join data.frame with itself just moved by one id
x_tidy %>% 
  mutate(id_last = id - 1) %>% 
  semi_join(x_tidy, by = c("id_last" = "id", "word" = "word"))
#>     speaker id  word id_last
#> 2.5       B  2  take       1
#> 2.6       B  2     a       1
#> 2.7       B  2  look       1
#> 2.8       B  2   you       1
#> 4.3       B  4     i       3
#> 4.4       B  4 think       3
#> 4.6       B  4    it       3

当然你想要做的事情有点复杂。您突出显示的示例词不完全相同，但 Levenshtein 距离最大为 2:

adist(c("look", "looks", "looked"))
#>      [,1] [,2] [,3]
#> [1,]    0    1    2
#> [2,]    1    0    2
#> [3,]    2    2    0
adist(c("yes", "yeah"))
#>      [,1] [,2]
#> [1,]    0    2
#> [2,]    2    0

遵循相同的 tidyverse 逻辑，有一个很棒的包。不幸的是，相应函数中的 by 参数似乎无法处理两列（或者它对两列应用了模糊逻辑，因此 0 和 2 被视为相同？），所以这不起作用:

x_tidy %>% 
  mutate(id_last = id - 1) %>% 
  fuzzyjoin::stringdist_semi_join(x_tidy, by = c("word" = "word", "id_last" = "id"), max_dist = 2)

然而，使用循环我们无论如何都可以实现缺失的功能：

library(fuzzyjoin)
map_df(unique(x_tidy$id), function(i) {
  current <- x_tidy %>% 
    filter(id == i)
  last <- x_tidy %>% 
    filter(id == i - 1)
  
  current %>%
    fuzzyjoin::stringdist_semi_join(last, by = "word", max_dist = 2)
})
#>      speaker id   word
#> 2.1        B  2  looks
#> 2.2        B  2   good
#> 2.3        B  2     we
#> 2.4        B  2    can
#> 2.5        B  2   take
#> 2.6        B  2      a
#> 2.7        B  2   look
#> 2.8        B  2    you
#> 2.9        B  2     go
#> 3.2        A  3   time
#> 3.3        A  3      i
#> 3.4        A  3 looked
#> 3.5        A  3    was
#> 3.7        A  3      i
#> 3.10       A  3     is
#> 3.11       A  3     it
#> 4          B  4    yes
#> 4.3        B  4      i
#> 4.4        B  4  think
#> 4.5        B  4 that's
#> 4.6        B  4     it

^{由 reprex package (v2.0.0)}

于 2021-04-22 创建

我不确定您的距离有多理想，以及您是否认为结果正确。或者，您可以在匹配之前尝试词干提取或词形还原，这可能会更好。我还为实现 stringsim_join 版本的包编写了一个新函数，它考虑了您要匹配的单词的长度。但是PR还没有通过

模糊匹配并提取对话中重复出现的单词

Fuzzy-match and extract words repeated across turns in conversation

regex

r

fuzzy-comparison

approximate