如何创建与 quanteda 的交互?
how to create interactions with quanteda?
考虑以下示例
library(quanteda)
library(tidyverse)
tibble(text = c('the dog is growing tall',
'the grass is growing as well')) %>%
corpus() %>% dfm()
Document-feature matrix of: 2 documents, 8 features (31.2% sparse).
features
docs the dog is growing tall grass as well
text1 1 1 1 1 1 0 0 0
text2 1 0 1 1 0 1 1 1
我想在每个句子中创建 dog
和其他标记之间的交互。也就是说,创建特征 the-dog
、is-dog
、growing-dog
、tall-dog
并将它们添加到 dfm
(在我们已有的特征之上)。
也就是说,例如,如果 the
和 dog
都出现在句子中,则 the-dog
等于 1(否则为零)。所以 the-dog
第一个句子是一个,第二个句子是零。
请注意,当 dog
出现在句子中时,我如何只创建交互项,因此此处不需要 dog-grass
。
我怎样才能在 quanteda
中有效地做到这一点?
library("quanteda")
## Package version: 2.1.2
toks <- tokens(c(
"the dog is growing tall",
"the grass is growing as well"
))
# now keep just tokens co-occurring with "dog"
toks_dog <- tokens_select(toks, "dog", window = 1e5)
# create the dfm and label other terms as interactions with dog
dfmat_dog <- dfm(toks_dog) %>%
dfm_remove("dog")
colnames(dfmat_dog) <- paste(featnames(dfmat_dog), "dog", sep = "-")
dfmat_dog
## Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.
## features
## docs the-dog is-dog growing-dog tall-dog
## text1 1 1 1 1
## text2 0 0 0 0
# combine with other features
print(cbind(dfm(toks), dfmat_dog), max_nfeat = -1)
## Document-feature matrix of: 2 documents, 12 features (37.50% sparse) and 0 docvars.
## features
## docs the dog is growing tall grass as well the-dog is-dog growing-dog
## text1 1 1 1 1 1 0 0 0 1 1 1
## text2 1 0 1 1 0 1 1 1 0 0 0
## features
## docs tall-dog
## text1 1
## text2 0
由 reprex package (v1.0.0)
于 2021 年 3 月 18 日创建
考虑以下示例
library(quanteda)
library(tidyverse)
tibble(text = c('the dog is growing tall',
'the grass is growing as well')) %>%
corpus() %>% dfm()
Document-feature matrix of: 2 documents, 8 features (31.2% sparse).
features
docs the dog is growing tall grass as well
text1 1 1 1 1 1 0 0 0
text2 1 0 1 1 0 1 1 1
我想在每个句子中创建 dog
和其他标记之间的交互。也就是说,创建特征 the-dog
、is-dog
、growing-dog
、tall-dog
并将它们添加到 dfm
(在我们已有的特征之上)。
也就是说,例如,如果 the
和 dog
都出现在句子中,则 the-dog
等于 1(否则为零)。所以 the-dog
第一个句子是一个,第二个句子是零。
请注意,当 dog
出现在句子中时,我如何只创建交互项,因此此处不需要 dog-grass
。
我怎样才能在 quanteda
中有效地做到这一点?
library("quanteda")
## Package version: 2.1.2
toks <- tokens(c(
"the dog is growing tall",
"the grass is growing as well"
))
# now keep just tokens co-occurring with "dog"
toks_dog <- tokens_select(toks, "dog", window = 1e5)
# create the dfm and label other terms as interactions with dog
dfmat_dog <- dfm(toks_dog) %>%
dfm_remove("dog")
colnames(dfmat_dog) <- paste(featnames(dfmat_dog), "dog", sep = "-")
dfmat_dog
## Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.
## features
## docs the-dog is-dog growing-dog tall-dog
## text1 1 1 1 1
## text2 0 0 0 0
# combine with other features
print(cbind(dfm(toks), dfmat_dog), max_nfeat = -1)
## Document-feature matrix of: 2 documents, 12 features (37.50% sparse) and 0 docvars.
## features
## docs the dog is growing tall grass as well the-dog is-dog growing-dog
## text1 1 1 1 1 1 0 0 0 1 1 1
## text2 1 0 1 1 0 1 1 1 0 0 0
## features
## docs tall-dog
## text1 1
## text2 0
由 reprex package (v1.0.0)
于 2021 年 3 月 18 日创建