Quanteda textplot_xray 按非唯一 docvar 分组为文档

Question

我有一个包含 10 个文档的 Quanteda 语料库，其中有几个是同一作者的。我将作者存储在单独的 docvar 列中 - myCorpus$documents[,"author"]

> docvars(myCorpus)

          author   
206035    author1   
269823    author2   
304225    author1   
422364    author2
<...snip..>

我正在绘制 Lexical Dispersion Plot with xplot_xray、

textplot_xray(
            kwic(myCorpus, "image"),
            kwic(myCorpus, "one"),
            kwic(myCorpus, "like"),
            kwic(myCorpusus, "time"),
            kwic(myCorpus, "just"),
            scale = "absolute"
          )

如何使用 myCorpus$documents[,"author"] 作为文档标识符而不是文档 ID？

我不是要对文档进行分组，我只是想通过作者来识别文档。我认识到文档 ID 需要是唯一的，所以不能简单地用 docnames(myCorpus)<-

重命名文档

Answer 1

textplot 文档名称取自 docnames 语料库。在这种情况下，您希望创建按 author 文档变量分组的新文档。这可以使用 texts() 提取器函数及其 groups 参数来完成。

为了创建一个可重现的示例，我将使用 built-in 数据对象 data_char_sampletext，并将其分割成句子以形成新文档，然后模拟作者 docvar。

library("quanteda")
# quanteda version 1.0.0

myCorpus <- corpus(data_char_sampletext) %>% 
    corpus_reshape(to = "sentences")
# make some duplicated author docvar values
set.seed(1)
docvars(myCorpus, "author") <- 
    sample(c("author1", "author2", "author3"), 
           size = ndoc(myCorpus), replace = TRUE)

这会产生：

summary(myCorpus)
# Corpus consisting of 15 documents:
#     
#     Text Types Tokens Sentences  author
#  text1.1    23     23         1 author1
#  text1.2    40     53         1 author2
#  text1.3    48     63         1 author2
#  text1.4    30     39         1 author3
#  text1.5    20     25         1 author1
#  text1.6    43     57         1 author3
#  text1.7    13     15         1 author3
#  text1.8    25     26         1 author2
#  text1.9     9      9         1 author2
# text1.10    37     53         1 author1
# text1.11    32     41         1 author1
# text1.12    30     30         1 author1
# text1.13    28     35         1 author3
# text1.14    16     18         1 author2
# text1.15    32     42         1 author3
# 
# Source:  /Users/kbenoit/tmp/* on x86_64 by kbenoit
# Created: Fri Feb 16 18:03:13 2018
# Notes:   corpus_reshape.corpus(., to = "sentences")

现在，我们将文本提取为字符向量，通过 author 文档变量对它们进行分组。这会生成一个长度为 3 的命名字符向量，其中名称是（唯一的）作者标识符。

groupedtexts <- texts(myCorpus, groups = "author")
length(groupedtexts)
# [1] 3
names(groupedtexts)
# [1] "author1" "author2" "author3"

然后（如图）：

textplot_xray(
    kwic(groupedtexts, "and"),
    kwic(groupedtexts, "for")
)

Quanteda textplot_xray 按非唯一 docvar 分组为文档

Quanteda textplot_xray grouped by non-unique docvar as document

plot

r

corpus

lexical

quanteda