Quanteda：如何创建语料库并绘制单词分布？

Question

我有一些数据如下所示：

  date      signs  horoscope                                                      newspaper   
  <chr>     <chr>  <chr>                                                          <chr>       
1 06-06-20~ ARIES  Your week falls neatly into distinct phases. The completion o~ Indian Expr~
2 06-06-20~ TAURUS You're coming to the end of an emotional period, when you've ~ Indian Expr~
3 06-06-20~ GEMINI Passions are still running high, and the degree of emotional ~ Times of In~
4 06-06-20~ CANCER First things first - don't rush it! There is still a great de~ Indian Expr~
5 06-06-20~ LEO    The greatest pressures are coming from all directions at once~ Indian Expr~

我想根据这些数据创建一个语料库，其中所有 horoscope 按 newspaper 和 signs 分组在一起作为文档。

例如报纸Times of India中的所有ARIES应该是一篇文章，但是按照日期的时间顺序排列（它们的索引应该按日期排序）。

由于我不知道如何按 newspaper 和 signs 对文本进行分组，我尝试为每份报纸创建两个不同的语料库。我试过这样做：


# Create a dataframe of only Times of India text
h_toi <- horoscopes %>%
  filter(newspaper == "Times of India") %>%
  select(-c("newspaper"))
  
# Create a corpus of out this
horo_corp_toi <- corpus(h_toi, text_field = "horoscope")

# Create docids
docids <- paste(h_toi$signs)

# Use this as docnames
docnames(horo_corp_toi) <- docids

head(docnames(horo_corp_toi), 5)
# [1] "ARIES.1"  "TAURUS.1" "GEMINI.1" "CANCER.1" "LEO.1"

但是如您所见，语料库的 docnames 是 "ARIES.1"、`"TAURUS.1" 等等。这是一个问题，因为当我尝试使用 quanteda 的 textplot_xray() 绘制它时，绘制了数千个文档，而不是每个符号只有 12 个文档：

# Plot lexical dispersion of love in all signs 
kwic(tokens(horo_corp_toi), pattern = "love") %>%
    textplot_xray()

相反，我希望能够做这样的事情：

我无法获得这种可视化效果，因为我最初不知道如何操作和创建语料库。我该怎么做，我做错了什么？

样本 DPUT 是 here

Answer 1

既然问题问的是如何同时按标志和报纸分组，那我先回答一下吧。

library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textplots")

## horoscopes <- [per linked dput in OP]

corp <- corpus(horoscopes, text_field = "horoscope")
toks <- tokens(corp)

# grouped by sign and newspaper
tokens_group(toks, groups = interaction(signs, newspaper)) %>%
  kwic(pattern = "love") %>%
  textplot_xray()

要达到上面的结果输出（这里只显示最后一张图），可以循环遍历报纸，只按signs分组。请注意，此处的星座数量有限，因为在提供的样本数据中，并未包含所有的生肖范围。

# separate kwic for each newspaper
for (i in unique(toks$newspaper)) {
  thiskwic <- toks %>%
    tokens_subset(newspaper == i) %>%
    tokens_group(signs) %>%
    kwic(pattern = "love")
  textplot_xray(thiskwic) +
    ggplot2::ggtitle(paste("Lexical dispersion plot -", toupper(i)))
}

Quanteda：如何创建语料库并绘制单词分布？

Quanteda: How do I create a corpus and plot dispersion of words?

r

corpus

quanteda