Quanteda:从 R 中的标记创建 ngram 和 skipgram

Quanteda: Create ngrams and skipgrams from tokens in R

我一直在浏览 R 中的 quanteda 包,但无法完全弄清楚 tokens_skipgrams 的功能。下面是the example from the manual of this package,我不太确定自己是否看懂了:

tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")   
tokens from 1 document.
text1 :
[1] "insurgents killed in"        "insurgents killed ongoing"  
[3] "insurgents killed fighting"  "insurgents in ongoing"      
[5] "insurgents in fighting"      "insurgents ongoing fighting"
[7] "killed in ongoing"           "killed in fighting"         
[9] "killed ongoing fighting"     "in ongoing fighting"        

我希望输出包含以下内容:

 "insurgents killed in"    "killed in ongoing"    "in ongoing fighting" 
 "insurgents in fighting"

为什么结果包括:

  "insurgents killed ongoing"  
  "insurgents killed fighting"  
  "insurgents in ongoing"      
  "insurgents ongoing fighting"
  "killed in fighting"         
  "killed ongoing fighting" 

在上面的示例中,skip = 0:2 即 skip 是 0、1 和 2。因此,我认为上面的命令可以安全地分成 3 个部分,每个部分的组合会给我上面的结果,正如我指出的那样,我无法得到。

tokens_skipgrams(toks, n = 3, skip = 0, concatenator = " ")   
tokens from 1 document.
text1 :
[1] "insurgents killed in" "killed in ongoing"    "in ongoing fighting" 

tokens_skipgrams(toks, n = 3, skip = 1, concatenator = " ")   
tokens from 1 document.
text1 :
[1] "insurgents in fighting"


tokens_skipgrams(toks, n = 3, skip = 2, concatenator = " ")   
tokens from 1 document.
text1 :
character(0)

但是结果的组合完全符合我的预期,而不是上面给出的结果。

有谁能帮我解决这个问题吗?

您观察到的行为是 Guthrie 等人 (2006) 对 skiagram 的定义的实现:"A k skip-gram is an ngram which is a superset of all ngrams and each (k-i) skipgram until (k-i)==0 (which includes 0 skip-grams)."(这在 ?tokens_skipgram 的 quanteda 手册页中被引用。原文来源是 Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling.".)。下面 s02 的例子直接取自那篇论文,它称之为“2-skip-tri-grams”。

然而,对于 skip 的标量值,为了给用户最大的控制权,这种跳过的递归实现没有实现。

这解释了如上所述将 skip 值作为单独的比例提供,然后作为序列 0:2 提供的差异。对于

toks <- tokens("insurgents killed in ongoing fighting")
toks
# tokens from 1 document.
# text1 :
# [1] "insurgents" "killed"     "in"         "ongoing"    "fighting" 

我们在 skip = 0:2 时观察到诸如 "insurgents killed fighting" 的组合,因为这包括跳过 0(在 "insurgents" 和 "killed" 之间)和 2(在 [=38= 之间) ] 和 "fighting")。对于这里的短语,这意味着从 skip = 0:1skip = 0:2:

只有两个额外的 skipgrams
(s01 <- tokens_skipgrams(toks, n = 3, skip = 0:1, concatenator = " "))
# tokens from 1 document.
# text1 :
# [1] "insurgents killed in"      "insurgents killed ongoing" "insurgents in ongoing"    
# [4] "insurgents in fighting"    "killed in ongoing"         "killed in fighting"       
# [7] "killed ongoing fighting"   "in ongoing fighting"      

(s02 <- tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " "))
# tokens from 1 document.
# text1 :
# [1] "insurgents killed in"        "insurgents killed ongoing"   "insurgents killed fighting" 
# [4] "insurgents in ongoing"       "insurgents in fighting"      "insurgents ongoing fighting"
# [7] "killed in ongoing"           "killed in fighting"          "killed ongoing fighting"    
# [10] "in ongoing fighting"        

setdiff(as.character(s02), as.character(s01))
# [1] "insurgents killed fighting"  "insurgents ongoing fighting"