在 R 中总是在一起的词

The words that come always together in R

我正在使用 R 并且我的数据集中有一个文本列,我需要知道是否有任何方法可以知道这些词总是放在一起。 就像大多数两个词放在一起或三个词...等等

例如:

Happy birthday to you 
Happy weekend 
Have a nice day
Be close 
Be smart 
Happy birthday 
It was a nice day
Happy birthday mama

所以结果应该是这样的

Happy birthday  - freq 3 
Nice day - freq 2

看来你需要的是创建二元语法并计算特征。这是一种使用 quanteda.

的方法

library(quanteda) 
text <- c("Happy birthday to you ", "Happy weekend ", "Have a nice day", 
          "Be close ", "Be smart ", "Happy birthday ", "It was a nice day", 
          "Happy birthday mama")
text %>% tokens() %>% 
  tokens_ngrams(n = 2, concatenator = " ") %>% dfm() %>% topfeatures()

## happy birthday         a nice       nice day    birthday to         to you       be smart 
##              3              2              2              1              1              1 
##  happy weekend         it was          was a         have a 
##              1              1              1              1 

它的作用是:

  1. 标记化
  2. 创建双字母组(与单个白色连接 space)
  3. 创建文档未来矩阵(根据 topfeatures 的要求)
  4. 计算出现频率最高的特征