从 texts/sentences 中提取搭配
Extracting collocates from texts/sentences
我有很多句子,每个句子至少出现一次'well'。我想获得紧邻 'well' 左侧出现的两个单词和紧邻 'well' 右侧出现的两个单词的列表。例如,在句子
"very well they all three get on well together"
左边的结果应该是:
"NA" "very"
"get""on"
右边:
"they" "all"
"together""NA"
我确实怀疑 sub() 和正则表达式会有用,但不知道(确切地)如何 assemble 查询。怎么做到的?
quanteda
和 tidyr
的组合将使您到达那里。我离开了库调用,这样你就可以看到哪个语句来自哪个包。
text <- "very well they all three get on well together"
library(magrittr)
text %>%
quanteda::kwic("well", window = 2) %>%
data.frame() %>%
tidyr::separate(pre, into = c("pre1", "pre2"), fill = "left") %>%
tidyr::separate(post, into = c("post1", "post2"), fill = "right")
docname from to pre1 pre2 keyword post1 post2
1 text1 2 2 <NA> very well they all
2 text1 8 8 get on well together <NA>
我有很多句子,每个句子至少出现一次'well'。我想获得紧邻 'well' 左侧出现的两个单词和紧邻 'well' 右侧出现的两个单词的列表。例如,在句子
"very well they all three get on well together"
左边的结果应该是: "NA" "very" "get""on"
右边: "they" "all" "together""NA"
我确实怀疑 sub() 和正则表达式会有用,但不知道(确切地)如何 assemble 查询。怎么做到的?
quanteda
和 tidyr
的组合将使您到达那里。我离开了库调用,这样你就可以看到哪个语句来自哪个包。
text <- "very well they all three get on well together"
library(magrittr)
text %>%
quanteda::kwic("well", window = 2) %>%
data.frame() %>%
tidyr::separate(pre, into = c("pre1", "pre2"), fill = "left") %>%
tidyr::separate(post, into = c("post1", "post2"), fill = "right")
docname from to pre1 pre2 keyword post1 post2
1 text1 2 2 <NA> very well they all
2 text1 8 8 get on well together <NA>