skipgrams 上下文中的关键字（kwic）？

Question

我使用 quanteda 对 ngram 和标记进行上下文分析中的关键字，效果很好。我现在想为 skipgrams 做这件事，捕捉“进入障碍”的上下文，还有“[...] [和]进入的障碍。

下面的代码是一个空的kwic对象，但我不知道我做错了什么。 dcc.corpus 指的是文本文件。我也使用了标记化版本，但没有任何变化。

结果是：

“具有 0 行的 kwic 对象”

x <- tokens("barriers entry")
ntoken_test <- tokens_ngrams(x, n = 2, skip = 0:4, concatenator = " ")
twic_skipgram <-  kwic(doc.corpus, pattern = list(ntoken_test), window=20, valuetype= "glob")

twic_skipgram

Answer 1

可能最简单的方法是用通配符来表示“跳过”。

library("quanteda")
## Package version: 2.1.1

txt <- c(
  "There are barriers to entry.",
  "Also barriers against entry.",
  "Just barriers entry."
)

# for skip of 1
kwic(txt, phrase("barriers * entry"))
##                                                     
##  [text1, 3:5] There are |   barriers to entry    | .
##  [text2, 2:4]      Also | barriers against entry | .

# for skip of 0 and 1
kwic(txt, phrase(c("barriers * entry", "barriers entry")))
##                                                     
##  [text1, 3:5] There are |   barriers to entry    | .
##  [text2, 2:4]      Also | barriers against entry | .
##  [text3, 2:3]      Just |     barriers entry     | .

skipgrams 上下文中的关键字（kwic）？

Keyword in context (kwic) for skipgrams?

nlp

r

text-mining

n-gram

quanteda