R 中的文本挖掘:计算 2-3 个单词短语
Text Mining in R: Counting 2-3 word phrases
我在 Whosebug 中发现了一段非常有用的代码 - Finding 2 & 3 word Phrases Using R TM Package
(信用@patrick perry)显示语料库中 2 和 3 个词短语的频率:
library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
## term count support
## 1 of the 336 1
## 2 the scarecrow 208 1
## 3 to the 185 1
## 4 and the 166 1
## 5 said the 152 1
## 6 in the 147 1
## 7 the lion 141 1
## 8 the tin 123 1
## 9 the tin woodman 114 1
## 10 tin woodman 114 1
## 11 i am 84 1
## 12 it was 69 1
## 13 in a 64 1
## 14 the great 63 1
## 15 the wicked 61 1
## 16 wicked witch 60 1
## 17 at the 59 1
## 18 the little 59 1
## 19 the wicked witch 58 1
## 20 back to 57 1
## ⋮ (52511 rows total)
如何确保 "the tin" 等短语的频率计数不包含在 "the tin woodman" 或 "tin woodman" 的频率计数中?
谢谢
删除停用词可以消除数据中的噪音,从而导致出现上述问题:
library(tm)
library(corpus)
library(dplyr)
corpus <- Corpus(VectorSource(gutenberg_corpus(55)))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
term_stats(corpus, ngrams = 2:3) %>%
arrange(desc(count)) %>%
group_by(grp = str_extract(as.character(term), "\w+\s+\w+")) %>%
mutate(count_unique = ifelse(length(unique(count)) > 1, max(count) - min(count), count)) %>%
ungroup() %>%
select(-grp)
我在 Whosebug 中发现了一段非常有用的代码 - Finding 2 & 3 word Phrases Using R TM Package (信用@patrick perry)显示语料库中 2 和 3 个词短语的频率:
library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
## term count support
## 1 of the 336 1
## 2 the scarecrow 208 1
## 3 to the 185 1
## 4 and the 166 1
## 5 said the 152 1
## 6 in the 147 1
## 7 the lion 141 1
## 8 the tin 123 1
## 9 the tin woodman 114 1
## 10 tin woodman 114 1
## 11 i am 84 1
## 12 it was 69 1
## 13 in a 64 1
## 14 the great 63 1
## 15 the wicked 61 1
## 16 wicked witch 60 1
## 17 at the 59 1
## 18 the little 59 1
## 19 the wicked witch 58 1
## 20 back to 57 1
## ⋮ (52511 rows total)
如何确保 "the tin" 等短语的频率计数不包含在 "the tin woodman" 或 "tin woodman" 的频率计数中?
谢谢
删除停用词可以消除数据中的噪音,从而导致出现上述问题:
library(tm)
library(corpus)
library(dplyr)
corpus <- Corpus(VectorSource(gutenberg_corpus(55)))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
term_stats(corpus, ngrams = 2:3) %>%
arrange(desc(count)) %>%
group_by(grp = str_extract(as.character(term), "\w+\s+\w+")) %>%
mutate(count_unique = ifelse(length(unique(count)) > 1, max(count) - min(count), count)) %>%
ungroup() %>%
select(-grp)