如何在 R 中查找和绘制 n-gram 的频率？

Question

我想要做的是找到多个 words/phrases 的频率并按年份将它们绘制在图表中。

我已经能够做到这一点，比如“美国”，但我无法表达多个词，比如“美国”。

我的 df 有一个用于实际文本的列，然后是用于元数据的附加列，例如作者、年份和组织。

这是我用来处理像“美国”这样的单词的代码：

a_corpus <- corpus(df, text = "text")

freq_grouped_year <- textstat_frequency(dfm(tokens(a_corpus)), 
                               groups = a_corpus$Year)


# COLLECTION NAME - Filter the term "american", use lower case words 
freq_word_year <- subset(freq_grouped_year, freq_grouped_year$feature 
%in% "american")  


ggplot(freq_word_year, aes(x = group, y = frequency)) +
    geom_point() + 
    scale_y_continuous(limits = c(0, 300), breaks = c(seq(0, 300, 
    30))) +
    xlab(NULL) + 
    ylab("Frequency") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

当我尝试使用像“united states”这样的二元字母时，没有任何显示。据我了解，dfm 创建了一个单独的单词列表，因此它们无论如何都不会排序，因此寻找双字母组或更多字母组是行不通的。

有没有办法找到双字母组、三字母组或更多字母组的频率？

谢谢！

Answer 1

要识别复合标记，或者在 quanteda 术语中，短语，您需要使用固定化合物列表来复合标记。（还有其他方法，比如用textstat_collocations()加过滤，但是既然你这里有一个固定的列表供选择，这个是最简单的。）

library("quanteda")
## Package version: 3.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")

a_corpus <- head(data_corpus_inaugural)

toks <- tokens(a_corpus)
toks <- tokens_compound(toks, phrase("United States"), concatenator = " ")

freq_grouped_year <- textstat_frequency(dfm(toks, tolower = FALSE), groups = Year)
freq_word_year <- subset(freq_grouped_year, freq_grouped_year$feature %in% "United States")

library("ggplot2")
ggplot(freq_word_year, aes(x = group, y = frequency)) +
  geom_point() +
  # scale_y_continuous(limits = c(0, 300), breaks = c(seq(0, 300, 30))) +
  xlab(NULL) +
  ylab("Frequency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

如何在 R 中查找和绘制 n-gram 的频率？

How to find and plot frequency of n-grams in R?

nlp

r

frequency-analysis

quanteda