Quanteda:如何创建语料库并绘制单词分布?

Quanteda: How do I create a corpus and plot dispersion of words?

我有一些数据如下所示:

  date      signs  horoscope                                                      newspaper   
  <chr>     <chr>  <chr>                                                          <chr>       
1 06-06-20~ ARIES  Your week falls neatly into distinct phases. The completion o~ Indian Expr~
2 06-06-20~ TAURUS You're coming to the end of an emotional period, when you've ~ Indian Expr~
3 06-06-20~ GEMINI Passions are still running high, and the degree of emotional ~ Times of In~
4 06-06-20~ CANCER First things first - don't rush it! There is still a great de~ Indian Expr~
5 06-06-20~ LEO    The greatest pressures are coming from all directions at once~ Indian Expr~

我想根据这些数据创建一个语料库,其中所有 horoscopenewspapersigns 分组在一起作为文档。

例如报纸Times of India中的所有ARIES应该是一篇文章,但是按照日期的时间顺序排列(它们的索引应该按日期排序)。

由于我不知道如何按 newspapersigns 对文本进行分组,我尝试为每份报纸创建两个不同的语料库。我试过这样做:


# Create a dataframe of only Times of India text
h_toi <- horoscopes %>%
  filter(newspaper == "Times of India") %>%
  select(-c("newspaper"))
  
# Create a corpus of out this
horo_corp_toi <- corpus(h_toi, text_field = "horoscope")

# Create docids
docids <- paste(h_toi$signs)

# Use this as docnames
docnames(horo_corp_toi) <- docids

head(docnames(horo_corp_toi), 5)
# [1] "ARIES.1"  "TAURUS.1" "GEMINI.1" "CANCER.1" "LEO.1" 

但是如您所见,语料库的 docnames"ARIES.1"、`"TAURUS.1" 等等。这是一个问题,因为当我尝试使用 quanteda 的 textplot_xray() 绘制它时,绘制了数千个文档,而不是每个符号只有 12 个文档:

# Plot lexical dispersion of love in all signs 
kwic(tokens(horo_corp_toi), pattern = "love") %>%
    textplot_xray()

相反,我希望能够做这样的事情:

我无法获得这种可视化效果,因为我最初不知道如何操作和创建语料库。我该怎么做,我做错了什么?

样本 DPUT 是 here

既然问题问的是如何同时按标志和报纸分组,那我先回答一下吧。

library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textplots")

## horoscopes <- [per linked dput in OP]

corp <- corpus(horoscopes, text_field = "horoscope")
toks <- tokens(corp)

# grouped by sign and newspaper
tokens_group(toks, groups = interaction(signs, newspaper)) %>%
  kwic(pattern = "love") %>%
  textplot_xray()

要达到上面的结果输出(这里只显示最后一张图),可以循环遍历报纸,只按signs分组。请注意,此处的星座数量有限,因为在提供的样本数据中,并未包含所有的生肖范围。

# separate kwic for each newspaper
for (i in unique(toks$newspaper)) {
  thiskwic <- toks %>%
    tokens_subset(newspaper == i) %>%
    tokens_group(signs) %>%
    kwic(pattern = "love")
  textplot_xray(thiskwic) +
    ggplot2::ggtitle(paste("Lexical dispersion plot -", toupper(i)))
}