在 R 中总是在一起的词
The words that come always together in R
我正在使用 R 并且我的数据集中有一个文本列,我需要知道是否有任何方法可以知道这些词总是放在一起。
就像大多数两个词放在一起或三个词...等等
例如:
Happy birthday to you
Happy weekend
Have a nice day
Be close
Be smart
Happy birthday
It was a nice day
Happy birthday mama
所以结果应该是这样的
Happy birthday - freq 3
Nice day - freq 2
看来你需要的是创建二元语法并计算特征。这是一种使用 quanteda
.
的方法
library(quanteda)
text <- c("Happy birthday to you ", "Happy weekend ", "Have a nice day",
"Be close ", "Be smart ", "Happy birthday ", "It was a nice day",
"Happy birthday mama")
text %>% tokens() %>%
tokens_ngrams(n = 2, concatenator = " ") %>% dfm() %>% topfeatures()
## happy birthday a nice nice day birthday to to you be smart
## 3 2 2 1 1 1
## happy weekend it was was a have a
## 1 1 1 1
它的作用是:
- 标记化
- 创建双字母组(与单个白色连接 space)
- 创建文档未来矩阵(根据
topfeatures
的要求)
- 计算出现频率最高的特征
我正在使用 R 并且我的数据集中有一个文本列,我需要知道是否有任何方法可以知道这些词总是放在一起。 就像大多数两个词放在一起或三个词...等等
例如:
Happy birthday to you
Happy weekend
Have a nice day
Be close
Be smart
Happy birthday
It was a nice day
Happy birthday mama
所以结果应该是这样的
Happy birthday - freq 3
Nice day - freq 2
看来你需要的是创建二元语法并计算特征。这是一种使用 quanteda
.
library(quanteda)
text <- c("Happy birthday to you ", "Happy weekend ", "Have a nice day",
"Be close ", "Be smart ", "Happy birthday ", "It was a nice day",
"Happy birthday mama")
text %>% tokens() %>%
tokens_ngrams(n = 2, concatenator = " ") %>% dfm() %>% topfeatures()
## happy birthday a nice nice day birthday to to you be smart
## 3 2 2 1 1 1
## happy weekend it was was a have a
## 1 1 1 1
它的作用是:
- 标记化
- 创建双字母组(与单个白色连接 space)
- 创建文档未来矩阵(根据
topfeatures
的要求) - 计算出现频率最高的特征