如何在 quanteda 中对加权 dfm 的列求和？

Question

考虑这个有趣的例子

mytib <- tibble(text = c('i can see clearly now',
                         'the rain is gone'),
                myweight = c(1.7, 0.005)) 
# A tibble: 2 x 2
  text                  myweight
  <chr>                    <dbl>
1 i can see clearly now    1.7  
2 the rain is gone         0.005

我知道如何创建由 docvars myweight 加权的 dfm。我进行如下操作：

dftest <- mytib %>% 
  corpus() %>% 
  tokens() %>% 
  dfm()

dftest * mytib$myweight 

Document-feature matrix of: 2 documents, 9 features (50.0% sparse).
2 x 9 sparse Matrix of class "dfm"
       features
docs      i can see clearly now   the  rain    is  gone
  text1 1.7 1.7 1.7     1.7 1.7 0     0     0     0    
  text2 0   0   0       0   0   0.005 0.005 0.005 0.005

但是问题是我既不能使用 topfeatures 也不能使用 colSums。

那么如何对每一列的值求和呢？

> dftest*mytib$myweight %>% Matrix::colSums(.)
Error in base::colSums(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions

谢谢！

Answer 1

有时 %>% 运算符会造成伤害而不是帮助。这有效：

colSums(dftest * mytib$myweight)
##      i     can     see clearly     now     the    rain      is    gone 
##  1.700   1.700   1.700   1.700   1.700   0.005   0.005   0.005   0.005

如果每个特征都有一个权重向量，也可以考虑使用 dfm_weight(x, weights = ...)。上面的操作将回收你的权重，使其按照你想要的方式工作，但你应该明白为什么（在 R 中，因为回收和它的 column-major 顺序）。

Answer 2

因为运算符的优先级。如果我们检查 ?Syntax，特殊运算符与乘法 (*)

相比具有更高的优先级

...
%any%   special operators (including %% and %/%)  ###
* / multiply, divide   ###
...

将表达式包裹在括号内，它应该可以工作

(dftest*mytib$myweight) %>% 
       colSums
#     i     can     see clearly     now     the    rain      is    gone 
#   1.700   1.700   1.700   1.700   1.700   0.005   0.005   0.005   0.005

如何在 quanteda 中对加权 dfm 的列求和？

how to sum the columns of a weighted dfm in quanteda?

r

sparse-matrix

quanteda