Quanteda:如何创建语料库并绘制单词分布?
Quanteda: How do I create a corpus and plot dispersion of words?
我有一些数据如下所示:
date signs horoscope newspaper
<chr> <chr> <chr> <chr>
1 06-06-20~ ARIES Your week falls neatly into distinct phases. The completion o~ Indian Expr~
2 06-06-20~ TAURUS You're coming to the end of an emotional period, when you've ~ Indian Expr~
3 06-06-20~ GEMINI Passions are still running high, and the degree of emotional ~ Times of In~
4 06-06-20~ CANCER First things first - don't rush it! There is still a great de~ Indian Expr~
5 06-06-20~ LEO The greatest pressures are coming from all directions at once~ Indian Expr~
我想根据这些数据创建一个语料库,其中所有 horoscope
按 newspaper
和 signs
分组在一起作为文档。
例如报纸Times of India
中的所有ARIES
应该是一篇文章,但是按照日期的时间顺序排列(它们的索引应该按日期排序)。
由于我不知道如何按 newspaper
和 signs
对文本进行分组,我尝试为每份报纸创建两个不同的语料库。我试过这样做:
# Create a dataframe of only Times of India text
h_toi <- horoscopes %>%
filter(newspaper == "Times of India") %>%
select(-c("newspaper"))
# Create a corpus of out this
horo_corp_toi <- corpus(h_toi, text_field = "horoscope")
# Create docids
docids <- paste(h_toi$signs)
# Use this as docnames
docnames(horo_corp_toi) <- docids
head(docnames(horo_corp_toi), 5)
# [1] "ARIES.1" "TAURUS.1" "GEMINI.1" "CANCER.1" "LEO.1"
但是如您所见,语料库的 docnames
是 "ARIES.1"
、`"TAURUS.1" 等等。这是一个问题,因为当我尝试使用 quanteda 的 textplot_xray() 绘制它时,绘制了数千个文档,而不是每个符号只有 12 个文档:
# Plot lexical dispersion of love in all signs
kwic(tokens(horo_corp_toi), pattern = "love") %>%
textplot_xray()
相反,我希望能够做这样的事情:
我无法获得这种可视化效果,因为我最初不知道如何操作和创建语料库。我该怎么做,我做错了什么?
样本 DPUT 是 here
既然问题问的是如何同时按标志和报纸分组,那我先回答一下吧。
library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textplots")
## horoscopes <- [per linked dput in OP]
corp <- corpus(horoscopes, text_field = "horoscope")
toks <- tokens(corp)
# grouped by sign and newspaper
tokens_group(toks, groups = interaction(signs, newspaper)) %>%
kwic(pattern = "love") %>%
textplot_xray()
要达到上面的结果输出(这里只显示最后一张图),可以循环遍历报纸,只按signs
分组。请注意,此处的星座数量有限,因为在提供的样本数据中,并未包含所有的生肖范围。
# separate kwic for each newspaper
for (i in unique(toks$newspaper)) {
thiskwic <- toks %>%
tokens_subset(newspaper == i) %>%
tokens_group(signs) %>%
kwic(pattern = "love")
textplot_xray(thiskwic) +
ggplot2::ggtitle(paste("Lexical dispersion plot -", toupper(i)))
}
我有一些数据如下所示:
date signs horoscope newspaper
<chr> <chr> <chr> <chr>
1 06-06-20~ ARIES Your week falls neatly into distinct phases. The completion o~ Indian Expr~
2 06-06-20~ TAURUS You're coming to the end of an emotional period, when you've ~ Indian Expr~
3 06-06-20~ GEMINI Passions are still running high, and the degree of emotional ~ Times of In~
4 06-06-20~ CANCER First things first - don't rush it! There is still a great de~ Indian Expr~
5 06-06-20~ LEO The greatest pressures are coming from all directions at once~ Indian Expr~
我想根据这些数据创建一个语料库,其中所有 horoscope
按 newspaper
和 signs
分组在一起作为文档。
例如报纸Times of India
中的所有ARIES
应该是一篇文章,但是按照日期的时间顺序排列(它们的索引应该按日期排序)。
由于我不知道如何按 newspaper
和 signs
对文本进行分组,我尝试为每份报纸创建两个不同的语料库。我试过这样做:
# Create a dataframe of only Times of India text
h_toi <- horoscopes %>%
filter(newspaper == "Times of India") %>%
select(-c("newspaper"))
# Create a corpus of out this
horo_corp_toi <- corpus(h_toi, text_field = "horoscope")
# Create docids
docids <- paste(h_toi$signs)
# Use this as docnames
docnames(horo_corp_toi) <- docids
head(docnames(horo_corp_toi), 5)
# [1] "ARIES.1" "TAURUS.1" "GEMINI.1" "CANCER.1" "LEO.1"
但是如您所见,语料库的 docnames
是 "ARIES.1"
、`"TAURUS.1" 等等。这是一个问题,因为当我尝试使用 quanteda 的 textplot_xray() 绘制它时,绘制了数千个文档,而不是每个符号只有 12 个文档:
# Plot lexical dispersion of love in all signs
kwic(tokens(horo_corp_toi), pattern = "love") %>%
textplot_xray()
相反,我希望能够做这样的事情:
我无法获得这种可视化效果,因为我最初不知道如何操作和创建语料库。我该怎么做,我做错了什么?
样本 DPUT 是 here
既然问题问的是如何同时按标志和报纸分组,那我先回答一下吧。
library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textplots")
## horoscopes <- [per linked dput in OP]
corp <- corpus(horoscopes, text_field = "horoscope")
toks <- tokens(corp)
# grouped by sign and newspaper
tokens_group(toks, groups = interaction(signs, newspaper)) %>%
kwic(pattern = "love") %>%
textplot_xray()
要达到上面的结果输出(这里只显示最后一张图),可以循环遍历报纸,只按signs
分组。请注意,此处的星座数量有限,因为在提供的样本数据中,并未包含所有的生肖范围。
# separate kwic for each newspaper
for (i in unique(toks$newspaper)) {
thiskwic <- toks %>%
tokens_subset(newspaper == i) %>%
tokens_group(signs) %>%
kwic(pattern = "love")
textplot_xray(thiskwic) +
ggplot2::ggtitle(paste("Lexical dispersion plot -", toupper(i)))
}