从 texts/sentences 中提取搭配

Question

我有很多句子，每个句子至少出现一次'well'。我想获得紧邻 'well' 左侧出现的两个单词和紧邻 'well' 右侧出现的两个单词的列表。例如，在句子

"very well they all three get on well together"

左边的结果应该是： "NA" "very" "get""on"

右边： "they" "all" "together""NA"

我确实怀疑 sub() 和正则表达式会有用，但不知道（确切地）如何 assemble 查询。怎么做到的？

Answer 1

quanteda 和 tidyr 的组合将使您到达那里。我离开了库调用，这样你就可以看到哪个语句来自哪个包。

text <- "very well they all three get on well together"

library(magrittr)

text %>% 
  quanteda::kwic("well", window = 2) %>% 
  data.frame() %>% 
  tidyr::separate(pre, into = c("pre1", "pre2"), fill = "left") %>% 
  tidyr::separate(post, into = c("post1", "post2"), fill = "right")

  docname from to pre1 pre2 keyword    post1 post2
1   text1    2  2 <NA> very    well     they   all
2   text1    8  8  get   on    well together  <NA>

从 texts/sentences 中提取搭配

Extracting collocates from texts/sentences

r

collocation