tm_map 和停用词未能从 R 中创建的语料库中删除不需要的词
tm_map and stopwords failed to remove unwanted words from the corpus created in R
我有一个包含以下数据的结果数据框:
word freq
credit credit 790
account account 451
xxxxxxxx xxxxxxxx 430
report report 405
information information 368
reporting reporting 345
consumer consumer 331
accounts accounts 300
debt debt 170
company company 152
xxxxxx xxxxxx 147
我想做以下事情:
- 删除所有xx,xxx,xxx等超过两个x的词
等等,因为这些词可以是小写或大写,所以必须
先变成小写再去掉
我正在使用 tm_map 删除停用词,但它似乎没有用,我仍然在数据框中得到了上面不需要的词。
myCorpus <- Corpus(VectorSource(df$txt))
myStopwords <- c(stopwords('english'),"xxx", "xxxx", "xxxxx",
"XXX", "XXXX", "XXXXX", "xxxx", "xxx", "xx", "xxxxxxxx",
"xxxxxxxx", "XXXXXX", "xxxxxx", "XXXXXXX", "xxxxxxx", "XXXXXXXX", "xxxxxxxx")
myCorpus <- tm_map(myCorpus, tolower)
myCorpus<- tm_map(myCorpus,removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myTdm <- as.matrix(TermDocumentMatrix(myCorpus))
v <- sort(rowSums(myTdm), decreasing=TRUE)
FreqMat <- data.frame(word = names(v), freq=v, row.names = F)
head(FreqMat, 10)
上面的代码对我来说无法从语料库中删除不需要的词。
有没有其他方法可以解决这个问题?
涉及 dplyr
和 stringr
的一种可能性是:
df %>%
mutate(word = tolower(word)) %>%
filter(str_count(word, fixed("x")) <= 1)
word freq
1 credit 790
2 account 451
3 report 405
4 information 368
5 reporting 345
6 consumer 331
7 accounts 300
8 debt 170
9 company 152
或者使用类似逻辑的 base R
可能性:
df[sapply(df[, 1],
function(x) length(grepRaw("x", tolower(x), all = TRUE, fixed = TRUE)) <= 1,
USE.NAMES = FALSE), ]
我有一个包含以下数据的结果数据框:
word freq
credit credit 790
account account 451
xxxxxxxx xxxxxxxx 430
report report 405
information information 368
reporting reporting 345
consumer consumer 331
accounts accounts 300
debt debt 170
company company 152
xxxxxx xxxxxx 147
我想做以下事情:
- 删除所有xx,xxx,xxx等超过两个x的词 等等,因为这些词可以是小写或大写,所以必须 先变成小写再去掉
我正在使用 tm_map 删除停用词,但它似乎没有用,我仍然在数据框中得到了上面不需要的词。
myCorpus <- Corpus(VectorSource(df$txt))
myStopwords <- c(stopwords('english'),"xxx", "xxxx", "xxxxx",
"XXX", "XXXX", "XXXXX", "xxxx", "xxx", "xx", "xxxxxxxx",
"xxxxxxxx", "XXXXXX", "xxxxxx", "XXXXXXX", "xxxxxxx", "XXXXXXXX", "xxxxxxxx")
myCorpus <- tm_map(myCorpus, tolower)
myCorpus<- tm_map(myCorpus,removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myTdm <- as.matrix(TermDocumentMatrix(myCorpus))
v <- sort(rowSums(myTdm), decreasing=TRUE)
FreqMat <- data.frame(word = names(v), freq=v, row.names = F)
head(FreqMat, 10)
上面的代码对我来说无法从语料库中删除不需要的词。
有没有其他方法可以解决这个问题?
涉及 dplyr
和 stringr
的一种可能性是:
df %>%
mutate(word = tolower(word)) %>%
filter(str_count(word, fixed("x")) <= 1)
word freq
1 credit 790
2 account 451
3 report 405
4 information 368
5 reporting 345
6 consumer 331
7 accounts 300
8 debt 170
9 company 152
或者使用类似逻辑的 base R
可能性:
df[sapply(df[, 1],
function(x) length(grepRaw("x", tolower(x), all = TRUE, fixed = TRUE)) <= 1,
USE.NAMES = FALSE), ]