如何在 quanteda 中做 add/subtract 文档术语矩阵？

Question

考虑这个简单的例子

dfm1 <- tibble(text = c('hello world',
                         'hello quanteda')) %>% 
  corpus() %>% tokens() %>% dfm()
> dfm1
Document-feature matrix of: 2 documents, 3 features (33.3% sparse).
2 x 3 sparse Matrix of class "dfm"
       features
docs    hello world quanteda
  text1     1     1        0
  text2     1     0        1

和

dfm2 <- tibble(text = c('hello world',
                        'good nigth quanteda')) %>% 
  corpus() %>% tokens() %>% dfm()
Document-feature matrix of: 2 documents, 5 features (50.0% sparse).
2 x 5 sparse Matrix of class "dfm"
       features
docs    hello world good nigth quanteda
  text1     1     1    0     0        0
  text2     0     0    1     1        1

如您所见，我们在两个 dfms 中具有相同的文本标识符：text1 和 text2。

我想 "subtract" dfm2 到 dfm1 以便 dfm1 中的每个条目都减去它在 dfm2 中的（可能）匹配条目（相同的文本，相同的词）

因此，例如，在 text1 中，hello 出现 1 次，在 text2 中也出现 1 次。因此该条目的输出应该为 0（即：1-1）。当然，不在两个dfms中的条目应该保持不变。

我如何在 quanteda 中做到这一点？

Answer 1

您可以使用 dfm_match() 将一个 dfm 的功能集与另一个 dfm 的功能集相匹配。我还整理了您的代码，因为对于这个简短的示例，您的一些管道可以得到简化。

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dfm1 <- dfm(c("hello world", "hello quanteda"))
dfm2 <- dfm(c("hello world", "good night quanteda"))

as.dfm(dfm1 - dfm_match(dfm2, features = featnames(dfm1)))
## Document-feature matrix of: 2 documents, 3 features (33.3% sparse).
## 2 x 3 sparse Matrix of class "dfm"
##        features
## docs    hello world quanteda
##   text1     0     0        0
##   text2     1     0        0

as.dfm() 是因为 + 运算符是为父稀疏矩阵 Matrix class 定义的，而不是专门为quanteda dfm，因此它删除 dfm 的 class 并将其变成 dgCMatrix。使用 as.dfm() 将其强制返回 dfm 解决了这个问题，但它会删除 dfm 对象的原始属性，例如 docvars。

如何在 quanteda 中做 add/subtract 文档术语矩阵？

How to do add/subtract document-term matrices in quanteda?

r

sparse-matrix

quanteda