Quanteda 按多个变量对文档进行分组
Quanteda group documents by multiple variables
我希望能够通过两个变量对我的 dfm 中的文档进行分组 - speaker 和 week_start。我以前能够使用
dfm(corpus, groups=c("speaker","week_start")
。这很好用,并且按演讲者周对文档进行了分组。
但是,随着最近对 quanteda 软件包的更新,我似乎 运行 遇到了一些问题。所以我现在先创建 dfm,然后再尝试分组。下面是代码
dfm <- dfm(corpus)
dfm <- dfm_group(dfm, groups = c(speaker, week_start))
但是,当我这样做时出现错误:
Error: groups must have length ndoc(x)
我也尝试过将 docvars 放在引号中,但这会产生同样的错误。
我们更改了 v3 中 groups
参数的用法,使其更加标准。
来自 news(Version >= "3.0", package = "quanteda")
:
We have added non-standard evaluation for by
and groups
arguments
to access object docvars:
- The
*_sample()
functions' argument by
, and groups
in the *_group()
functions, now take unquoted document variable (docvar)
names directly, similar to the way the subset
argument works in the
*_subset()
functions.
- Quoted docvar names no longer work, as these will be evaluated literally.
- The
by = "document"
formerly sampled from docid(x)
, but this functionality is now removed. Instead, use by = docid(x)
to
replicate this functionality.
- For
groups
, the default is now docid(x)
, which is now documented more completely. See ?groups
and ?docid
.
因此,要获得之前的行为,您需要使用:
groups = interaction(speaker, week_start)
这是一个例子:
library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
corp <- corpus(c(
"a b c",
"a c d",
"c d d",
"d d e"
),
docvars = data.frame(
var1 = c("a", "a", "b", "b"),
var2 = c(1, 2, 1, 1)
)
)
corp %>%
tokens() %>%
dfm() %>%
dfm_group(groups = interaction(var1, var2))
## Document-feature matrix of: 3 documents, 5 features (40.00% sparse) and 2 docvars.
## features
## docs a b c d e
## a.1 1 1 1 0 0
## b.1 0 0 1 4 1
## a.2 1 0 1 1 0
我希望能够通过两个变量对我的 dfm 中的文档进行分组 - speaker 和 week_start。我以前能够使用
dfm(corpus, groups=c("speaker","week_start")
。这很好用,并且按演讲者周对文档进行了分组。
但是,随着最近对 quanteda 软件包的更新,我似乎 运行 遇到了一些问题。所以我现在先创建 dfm,然后再尝试分组。下面是代码
dfm <- dfm(corpus)
dfm <- dfm_group(dfm, groups = c(speaker, week_start))
但是,当我这样做时出现错误:
Error: groups must have length ndoc(x)
我也尝试过将 docvars 放在引号中,但这会产生同样的错误。
我们更改了 v3 中 groups
参数的用法,使其更加标准。
来自 news(Version >= "3.0", package = "quanteda")
:
We have added non-standard evaluation for
by
andgroups
arguments to access object docvars:
- The
*_sample()
functions' argumentby
, andgroups
in the*_group()
functions, now take unquoted document variable (docvar) names directly, similar to the way thesubset
argument works in the*_subset()
functions.- Quoted docvar names no longer work, as these will be evaluated literally.
- The
by = "document"
formerly sampled fromdocid(x)
, but this functionality is now removed. Instead, useby = docid(x)
to replicate this functionality.- For
groups
, the default is nowdocid(x)
, which is now documented more completely. See?groups
and?docid
.
因此,要获得之前的行为,您需要使用:
groups = interaction(speaker, week_start)
这是一个例子:
library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
corp <- corpus(c(
"a b c",
"a c d",
"c d d",
"d d e"
),
docvars = data.frame(
var1 = c("a", "a", "b", "b"),
var2 = c(1, 2, 1, 1)
)
)
corp %>%
tokens() %>%
dfm() %>%
dfm_group(groups = interaction(var1, var2))
## Document-feature matrix of: 3 documents, 5 features (40.00% sparse) and 2 docvars.
## features
## docs a b c d e
## a.1 1 1 1 0 0
## b.1 0 0 1 4 1
## a.2 1 0 1 1 0