Quanteda:从 R 中的标记创建 ngram 和 skipgram
Quanteda: Create ngrams and skipgrams from tokens in R
我一直在浏览 R 中的 quanteda 包,但无法完全弄清楚 tokens_skipgrams 的功能。下面是the example from the manual of this package,我不太确定自己是否看懂了:
tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")
tokens from 1 document.
text1 :
[1] "insurgents killed in" "insurgents killed ongoing"
[3] "insurgents killed fighting" "insurgents in ongoing"
[5] "insurgents in fighting" "insurgents ongoing fighting"
[7] "killed in ongoing" "killed in fighting"
[9] "killed ongoing fighting" "in ongoing fighting"
我希望输出包含以下内容:
"insurgents killed in" "killed in ongoing" "in ongoing fighting"
"insurgents in fighting"
为什么结果包括:
"insurgents killed ongoing"
"insurgents killed fighting"
"insurgents in ongoing"
"insurgents ongoing fighting"
"killed in fighting"
"killed ongoing fighting"
在上面的示例中,skip = 0:2 即 skip 是 0、1 和 2。因此,我认为上面的命令可以安全地分成 3 个部分,每个部分的组合会给我上面的结果,正如我指出的那样,我无法得到。
tokens_skipgrams(toks, n = 3, skip = 0, concatenator = " ")
tokens from 1 document.
text1 :
[1] "insurgents killed in" "killed in ongoing" "in ongoing fighting"
tokens_skipgrams(toks, n = 3, skip = 1, concatenator = " ")
tokens from 1 document.
text1 :
[1] "insurgents in fighting"
tokens_skipgrams(toks, n = 3, skip = 2, concatenator = " ")
tokens from 1 document.
text1 :
character(0)
但是结果的组合完全符合我的预期,而不是上面给出的结果。
有谁能帮我解决这个问题吗?
您观察到的行为是 Guthrie 等人 (2006) 对 skiagram 的定义的实现:"A k skip-gram is an ngram which is a superset of all ngrams and each (k-i) skipgram until (k-i)==0 (which includes 0 skip-grams)."(这在 ?tokens_skipgram
的 quanteda 手册页中被引用。原文来源是
Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling.".)。下面 s02
的例子直接取自那篇论文,它称之为“2-skip-tri-grams”。
然而,对于 skip
的标量值,为了给用户最大的控制权,这种跳过的递归实现没有实现。
这解释了如上所述将 skip
值作为单独的比例提供,然后作为序列 0:2
提供的差异。对于
toks <- tokens("insurgents killed in ongoing fighting")
toks
# tokens from 1 document.
# text1 :
# [1] "insurgents" "killed" "in" "ongoing" "fighting"
我们在 skip = 0:2
时观察到诸如 "insurgents killed fighting" 的组合,因为这包括跳过 0(在 "insurgents" 和 "killed" 之间)和 2(在 [=38= 之间) ] 和 "fighting")。对于这里的短语,这意味着从 skip = 0:1
到 skip = 0:2
:
只有两个额外的 skipgrams
(s01 <- tokens_skipgrams(toks, n = 3, skip = 0:1, concatenator = " "))
# tokens from 1 document.
# text1 :
# [1] "insurgents killed in" "insurgents killed ongoing" "insurgents in ongoing"
# [4] "insurgents in fighting" "killed in ongoing" "killed in fighting"
# [7] "killed ongoing fighting" "in ongoing fighting"
(s02 <- tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " "))
# tokens from 1 document.
# text1 :
# [1] "insurgents killed in" "insurgents killed ongoing" "insurgents killed fighting"
# [4] "insurgents in ongoing" "insurgents in fighting" "insurgents ongoing fighting"
# [7] "killed in ongoing" "killed in fighting" "killed ongoing fighting"
# [10] "in ongoing fighting"
setdiff(as.character(s02), as.character(s01))
# [1] "insurgents killed fighting" "insurgents ongoing fighting"
我一直在浏览 R 中的 quanteda 包,但无法完全弄清楚 tokens_skipgrams 的功能。下面是the example from the manual of this package,我不太确定自己是否看懂了:
tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")
tokens from 1 document.
text1 :
[1] "insurgents killed in" "insurgents killed ongoing"
[3] "insurgents killed fighting" "insurgents in ongoing"
[5] "insurgents in fighting" "insurgents ongoing fighting"
[7] "killed in ongoing" "killed in fighting"
[9] "killed ongoing fighting" "in ongoing fighting"
我希望输出包含以下内容:
"insurgents killed in" "killed in ongoing" "in ongoing fighting"
"insurgents in fighting"
为什么结果包括:
"insurgents killed ongoing"
"insurgents killed fighting"
"insurgents in ongoing"
"insurgents ongoing fighting"
"killed in fighting"
"killed ongoing fighting"
在上面的示例中,skip = 0:2 即 skip 是 0、1 和 2。因此,我认为上面的命令可以安全地分成 3 个部分,每个部分的组合会给我上面的结果,正如我指出的那样,我无法得到。
tokens_skipgrams(toks, n = 3, skip = 0, concatenator = " ")
tokens from 1 document.
text1 :
[1] "insurgents killed in" "killed in ongoing" "in ongoing fighting"
tokens_skipgrams(toks, n = 3, skip = 1, concatenator = " ")
tokens from 1 document.
text1 :
[1] "insurgents in fighting"
tokens_skipgrams(toks, n = 3, skip = 2, concatenator = " ")
tokens from 1 document.
text1 :
character(0)
但是结果的组合完全符合我的预期,而不是上面给出的结果。
有谁能帮我解决这个问题吗?
您观察到的行为是 Guthrie 等人 (2006) 对 skiagram 的定义的实现:"A k skip-gram is an ngram which is a superset of all ngrams and each (k-i) skipgram until (k-i)==0 (which includes 0 skip-grams)."(这在 ?tokens_skipgram
的 quanteda 手册页中被引用。原文来源是
Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling.".)。下面 s02
的例子直接取自那篇论文,它称之为“2-skip-tri-grams”。
然而,对于 skip
的标量值,为了给用户最大的控制权,这种跳过的递归实现没有实现。
这解释了如上所述将 skip
值作为单独的比例提供,然后作为序列 0:2
提供的差异。对于
toks <- tokens("insurgents killed in ongoing fighting")
toks
# tokens from 1 document.
# text1 :
# [1] "insurgents" "killed" "in" "ongoing" "fighting"
我们在 skip = 0:2
时观察到诸如 "insurgents killed fighting" 的组合,因为这包括跳过 0(在 "insurgents" 和 "killed" 之间)和 2(在 [=38= 之间) ] 和 "fighting")。对于这里的短语,这意味着从 skip = 0:1
到 skip = 0:2
:
(s01 <- tokens_skipgrams(toks, n = 3, skip = 0:1, concatenator = " "))
# tokens from 1 document.
# text1 :
# [1] "insurgents killed in" "insurgents killed ongoing" "insurgents in ongoing"
# [4] "insurgents in fighting" "killed in ongoing" "killed in fighting"
# [7] "killed ongoing fighting" "in ongoing fighting"
(s02 <- tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " "))
# tokens from 1 document.
# text1 :
# [1] "insurgents killed in" "insurgents killed ongoing" "insurgents killed fighting"
# [4] "insurgents in ongoing" "insurgents in fighting" "insurgents ongoing fighting"
# [7] "killed in ongoing" "killed in fighting" "killed ongoing fighting"
# [10] "in ongoing fighting"
setdiff(as.character(s02), as.character(s01))
# [1] "insurgents killed fighting" "insurgents ongoing fighting"