从 texts/sentences 中提取搭配

Extracting collocates from texts/sentences

我有很多句子,每个句子至少出现一次'well'。我想获得紧邻 'well' 左侧出现的两个单词和紧邻 'well' 右侧出现的两个单词的列表。例如,在句子

"very well they all three get on well together"

左边的结果应该是: "NA" "very" "get""on"

右边: "they" "all" "together""NA"

我确实怀疑 sub() 和正则表达式会有用,但不知道(确切地)如何 assemble 查询。怎么做到的?

quantedatidyr 的组合将使您到达那里。我离开了库调用,这样你就可以看到哪个语句来自哪个包。

text <- "very well they all three get on well together"

library(magrittr)

text %>% 
  quanteda::kwic("well", window = 2) %>% 
  data.frame() %>% 
  tidyr::separate(pre, into = c("pre1", "pre2"), fill = "left") %>% 
  tidyr::separate(post, into = c("post1", "post2"), fill = "right")

  docname from to pre1 pre2 keyword    post1 post2
1   text1    2  2 <NA> very    well     they   all
2   text1    8  8  get   on    well together  <NA>