Quanteda 按多个变量对文档进行分组

Question

我希望能够通过两个变量对我的 dfm 中的文档进行分组 - speaker 和 week_start。我以前能够使用 dfm(corpus, groups=c("speaker","week_start")。这很好用，并且按演讲者周对文档进行了分组。

但是，随着最近对 quanteda 软件包的更新，我似乎运行遇到了一些问题。所以我现在先创建 dfm，然后再尝试分组。下面是代码

dfm <- dfm(corpus)
dfm <- dfm_group(dfm, groups = c(speaker, week_start))

但是，当我这样做时出现错误：

Error: groups must have length ndoc(x)

我也尝试过将 docvars 放在引号中，但这会产生同样的错误。

Answer 1

我们更改了 v3 中 groups 参数的用法，使其更加标准。

来自 news(Version >= "3.0", package = "quanteda"):

We have added non-standard evaluation for by and groups arguments to access object docvars:

The *_sample() functions' argument by, and groups in the *_group() functions, now take unquoted document variable (docvar) names directly, similar to the way the subset argument works in the *_subset() functions.

Quoted docvar names no longer work, as these will be evaluated literally.

The by = "document" formerly sampled from docid(x), but this functionality is now removed. Instead, use by = docid(x) to replicate this functionality.

For groups, the default is now docid(x), which is now documented more completely. See ?groups and ?docid.

因此，要获得之前的行为，您需要使用：

groups = interaction(speaker, week_start)

这是一个例子：

library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

corp <- corpus(c(
  "a b c",
  "a c d",
  "c d d",
  "d d e"
),
docvars = data.frame(
  var1 = c("a", "a", "b", "b"),
  var2 = c(1, 2, 1, 1)
)
)
corp %>%
  tokens() %>%
  dfm() %>%
  dfm_group(groups = interaction(var1, var2))
## Document-feature matrix of: 3 documents, 5 features (40.00% sparse) and 2 docvars.
##      features
## docs  a b c d e
##   a.1 1 1 1 0 0
##   b.1 0 0 1 4 1
##   a.2 1 0 1 1 0

Quanteda 按多个变量对文档进行分组

Quanteda group documents by multiple variables

nlp

r

quanteda