R 中的文本挖掘:计算 2-3 个单词短语

Text Mining in R: Counting 2-3 word phrases

我在 Whosebug 中发现了一段非常有用的代码 - Finding 2 & 3 word Phrases Using R TM Package (信用@patrick perry)显示语料库中 2 和 3 个词短语的频率:

library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
##    term             count support
## 1  of the             336       1
## 2  the scarecrow      208       1
## 3  to the             185       1
## 4  and the            166       1
## 5  said the           152       1
## 6  in the             147       1
## 7  the lion           141       1
## 8  the tin            123       1
## 9  the tin woodman    114       1
## 10 tin woodman        114       1
## 11 i am                84       1
## 12 it was              69       1
## 13 in a                64       1
## 14 the great           63       1
## 15 the wicked          61       1
## 16 wicked witch        60       1
## 17 at the              59       1
## 18 the little          59       1
## 19 the wicked witch    58       1
## 20 back to             57       1
## ⋮  (52511 rows total)

如何确保 "the tin" 等短语的频率计数不包含在 "the tin woodman" 或 "tin woodman" 的频率计数中?

谢谢

删除停用词可以消除数据中的噪音,从而导致出现上述问题:

library(tm)
library(corpus)
library(dplyr)
corpus <- Corpus(VectorSource(gutenberg_corpus(55)))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
term_stats(corpus, ngrams = 2:3) %>% 
  arrange(desc(count)) %>%
  group_by(grp = str_extract(as.character(term), "\w+\s+\w+")) %>% 
  mutate(count_unique = ifelse(length(unique(count)) > 1, max(count) - min(count), count)) %>% 
  ungroup() %>% 
  select(-grp)