R tm package select 文本语料库中要保留的大量单词
R tm package select huge amount of words to keep in text corpus
我有大约 70.000 frequent_words
,我想按照它们出现的相同顺序保存在文本语料库中(顺序很重要)。我得到的是这样的:
dtm <- DocumentTermMatrix(txt_corpus, control = list(wordLengths=c(1, Inf)))
frequent_words <- findFreqTerms(dtm, lowfreq=50)
正在做:
dtm <- DocumentTermMatrix(txt_corpus, control = list(wordLengths=c(1, Inf)))
dtm <- removeSparseTerms(dtm, 0.8)
不会工作,因为我需要相同的过滤 text_corpus
两次:
dtm <- DocumentTermMatrix(txt_corpus, control = list(wordLengths=c(1, Inf)))
BigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
bidtm <- DocumentTermMatrix(txt_corpus, control = list(tokenize = BigramTokenizer))
我试过下面的代码:
keepWords <- content_transformer(function(x, words) {
regmatches(x,
gregexpr(paste0("(\b", paste(words, collapse = "\b|\b"), "\b)"), x, perl = T, ignore.case=T, useBytes = T)
, invert = T) <- " "
return(x)
})
txt_corpus <- tm_map(txt_corpus, keepWords, frequent_words)
当我 运行 它时,我得到错误:
Error in gregexpr(paste0("(\b", paste(words, collapse = "\b|\b"), "\b)"), :
assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
Calls: preprocess ... tm_parLapply -> lapply -> FUN -> FUN -> regmatches<- -> gregexpr
Execution halted
这是正则表达式过长造成的。删除非频繁词是不可能的,因为 length(less_frequent_words)
> 1.000.000 并且需要很长时间:
chunk <- 500
n <- length(less_frequent_words)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(less_frequent_words, r)
for (i in 1:length(d)) {
txt_corpus <- tm_map(txt_corpus, removeWords, c(paste(d[[i]])))
}
我也尝试过加入一些东西,但它在每次迭代中都给我一个独特的文本语料库:
chunk <- 500
n <- length(frequent_words)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(frequent_words, r)
joined_txt_corpus <- VCorpus(VectorSource(list()))
for (i in 1:length(d)) {
new_corpus <- tm_map(txt_corpus, keepWords, c(paste(d[[i]])))
joined_txt_corpus <- c(joined_txt_corpus, new_corpus)
txt_corpus <- tm_map(txt_corpus, removeWords, c(paste(d[[i]])))
}
txt_corpus <- joined_txt_corpus
有没有一种有效的方法可以像 text_corpus <- tm_map(txt_corpus, keepWords, frequent_words)
一样选择很多单词?任何帮助和提示表示赞赏!谢谢!
可重现的例子:
library(tm)
data(crude)
txt_corpus <- crude
txt_corpus <- tm_map(txt_corpus, content_transformer(tolower))
txt_corpus <- tm_map(txt_corpus, removePunctuation)
txt_corpus <- tm_map(txt_corpus, stripWhitespace)
article_words <- c("a", "an", "the")
txt_corpus <- tm_map(txt_corpus, removeWords, article_words)
txt_corpus <- tm_map(txt_corpus, removeNumbers)
dtm <- DocumentTermMatrix(txt_corpus, control = list(wordLengths=c(1, Inf)))
frequent_words <- findFreqTerms(dtm, lowfreq=80)
dtm <- DocumentTermMatrix(txt_corpus, control = list(wordLengths=c(1, Inf), dictionary=frequent_words))
# Use many words just using frequent_words once works
# frequent_words <- c(frequent_words, frequent_words, frequent_words, frequent_words)
# keepWords function
keepWords <- content_transformer(function(x, words) {
regmatches(x,
gregexpr(paste0("(\b", paste(words, collapse = "\b|\b"), "\b)"), x, perl = T, ignore.case=T)
, invert = T) <- " "
return(x)
})
txt_corpus <- tm_map(txt_corpus, keepWords, frequent_words)
# Get bigram from text_corpus
BigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
bidtm <- DocumentTermMatrix(txt_corpus, control = list(tokenize = BigramTokenizer))
bidtmm <- col_sums(bidtm)
bidtmm <- as.matrix(bidtmm)
print(bidtmm)
输出:
[,1]
in in 14
in of 21
in oil 19
in to 28
of in 21
of of 20
of oil 20
of to 29
oil in 18
oil of 18
oil oil 13
oil to 33
to in 32
to of 35
to oil 21
to to 41
我查看了您的要求,也许结合使用 tm 和 quanteda 会有所帮助。见下文。
获得常用词列表后,您可以并行使用 quanteda 来获取双字母组。
library(quanteda)
# set number of threads
quanteda_options(threads = 4)
my_corp <- corpus(crude) # corpus from tm can be used here (txt_corpus)
my_toks <- tokens(my_corp, remove_punct = TRUE) # add extra removal if needed
# Use list of frequent words from tm.
# speed gain should occur here
my_toks <- tokens_keep(my_toks, frequent_words)
# ngrams, concatenator is _ by default
bitoks <- tokens_ngrams(my_toks)
textstat_frequency(dfm(bitoks)) # ordered from high to low
feature frequency rank docfreq group
1 to_to 41 1 12 all
2 to_of 35 2 15 all
3 oil_to 33 3 17 all
4 to_in 32 4 12 all
5 of_to 29 5 14 all
6 in_to 28 6 11 all
7 in_of 21 7 8 all
8 to_oil 21 7 13 all
9 of_in 21 7 10 all
10 of_oil 20 10 14 all
11 of_of 20 10 8 all
12 in_oil 19 12 10 all
13 oil_in 18 13 11 all
14 oil_of 18 13 11 all
15 in_in 14 15 9 all
16 oil_oil 13 16 10 all
quanteda 确实有一个 topfeatures
功能,但它不像 findfreqterms
那样工作。否则你可以完全在 quanteda 中完成。
如果 dfm
生成占用太多内存,您可以使用 as.character 转换令牌对象并在 dplyr 或 data.table 中使用它。请参阅下面的代码。
library(dplyr)
out_dp <- tibble(features = as.character(bitoks)) %>%
group_by(features) %>%
tally()
library(data.table)
out_dt <- data.table(features = as.character(bitoks))
out_dt <- out_dt[, .N, by = features]
我有大约 70.000 frequent_words
,我想按照它们出现的相同顺序保存在文本语料库中(顺序很重要)。我得到的是这样的:
dtm <- DocumentTermMatrix(txt_corpus, control = list(wordLengths=c(1, Inf)))
frequent_words <- findFreqTerms(dtm, lowfreq=50)
正在做:
dtm <- DocumentTermMatrix(txt_corpus, control = list(wordLengths=c(1, Inf)))
dtm <- removeSparseTerms(dtm, 0.8)
不会工作,因为我需要相同的过滤 text_corpus
两次:
dtm <- DocumentTermMatrix(txt_corpus, control = list(wordLengths=c(1, Inf)))
BigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
bidtm <- DocumentTermMatrix(txt_corpus, control = list(tokenize = BigramTokenizer))
我试过下面的代码:
keepWords <- content_transformer(function(x, words) {
regmatches(x,
gregexpr(paste0("(\b", paste(words, collapse = "\b|\b"), "\b)"), x, perl = T, ignore.case=T, useBytes = T)
, invert = T) <- " "
return(x)
})
txt_corpus <- tm_map(txt_corpus, keepWords, frequent_words)
当我 运行 它时,我得到错误:
Error in gregexpr(paste0("(\b", paste(words, collapse = "\b|\b"), "\b)"), :
assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
Calls: preprocess ... tm_parLapply -> lapply -> FUN -> FUN -> regmatches<- -> gregexpr
Execution halted
这是正则表达式过长造成的。删除非频繁词是不可能的,因为 length(less_frequent_words)
> 1.000.000 并且需要很长时间:
chunk <- 500
n <- length(less_frequent_words)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(less_frequent_words, r)
for (i in 1:length(d)) {
txt_corpus <- tm_map(txt_corpus, removeWords, c(paste(d[[i]])))
}
我也尝试过加入一些东西,但它在每次迭代中都给我一个独特的文本语料库:
chunk <- 500
n <- length(frequent_words)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(frequent_words, r)
joined_txt_corpus <- VCorpus(VectorSource(list()))
for (i in 1:length(d)) {
new_corpus <- tm_map(txt_corpus, keepWords, c(paste(d[[i]])))
joined_txt_corpus <- c(joined_txt_corpus, new_corpus)
txt_corpus <- tm_map(txt_corpus, removeWords, c(paste(d[[i]])))
}
txt_corpus <- joined_txt_corpus
有没有一种有效的方法可以像 text_corpus <- tm_map(txt_corpus, keepWords, frequent_words)
一样选择很多单词?任何帮助和提示表示赞赏!谢谢!
可重现的例子:
library(tm)
data(crude)
txt_corpus <- crude
txt_corpus <- tm_map(txt_corpus, content_transformer(tolower))
txt_corpus <- tm_map(txt_corpus, removePunctuation)
txt_corpus <- tm_map(txt_corpus, stripWhitespace)
article_words <- c("a", "an", "the")
txt_corpus <- tm_map(txt_corpus, removeWords, article_words)
txt_corpus <- tm_map(txt_corpus, removeNumbers)
dtm <- DocumentTermMatrix(txt_corpus, control = list(wordLengths=c(1, Inf)))
frequent_words <- findFreqTerms(dtm, lowfreq=80)
dtm <- DocumentTermMatrix(txt_corpus, control = list(wordLengths=c(1, Inf), dictionary=frequent_words))
# Use many words just using frequent_words once works
# frequent_words <- c(frequent_words, frequent_words, frequent_words, frequent_words)
# keepWords function
keepWords <- content_transformer(function(x, words) {
regmatches(x,
gregexpr(paste0("(\b", paste(words, collapse = "\b|\b"), "\b)"), x, perl = T, ignore.case=T)
, invert = T) <- " "
return(x)
})
txt_corpus <- tm_map(txt_corpus, keepWords, frequent_words)
# Get bigram from text_corpus
BigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
bidtm <- DocumentTermMatrix(txt_corpus, control = list(tokenize = BigramTokenizer))
bidtmm <- col_sums(bidtm)
bidtmm <- as.matrix(bidtmm)
print(bidtmm)
输出:
[,1]
in in 14
in of 21
in oil 19
in to 28
of in 21
of of 20
of oil 20
of to 29
oil in 18
oil of 18
oil oil 13
oil to 33
to in 32
to of 35
to oil 21
to to 41
我查看了您的要求,也许结合使用 tm 和 quanteda 会有所帮助。见下文。
获得常用词列表后,您可以并行使用 quanteda 来获取双字母组。
library(quanteda)
# set number of threads
quanteda_options(threads = 4)
my_corp <- corpus(crude) # corpus from tm can be used here (txt_corpus)
my_toks <- tokens(my_corp, remove_punct = TRUE) # add extra removal if needed
# Use list of frequent words from tm.
# speed gain should occur here
my_toks <- tokens_keep(my_toks, frequent_words)
# ngrams, concatenator is _ by default
bitoks <- tokens_ngrams(my_toks)
textstat_frequency(dfm(bitoks)) # ordered from high to low
feature frequency rank docfreq group
1 to_to 41 1 12 all
2 to_of 35 2 15 all
3 oil_to 33 3 17 all
4 to_in 32 4 12 all
5 of_to 29 5 14 all
6 in_to 28 6 11 all
7 in_of 21 7 8 all
8 to_oil 21 7 13 all
9 of_in 21 7 10 all
10 of_oil 20 10 14 all
11 of_of 20 10 8 all
12 in_oil 19 12 10 all
13 oil_in 18 13 11 all
14 oil_of 18 13 11 all
15 in_in 14 15 9 all
16 oil_oil 13 16 10 all
quanteda 确实有一个 topfeatures
功能,但它不像 findfreqterms
那样工作。否则你可以完全在 quanteda 中完成。
如果 dfm
生成占用太多内存,您可以使用 as.character 转换令牌对象并在 dplyr 或 data.table 中使用它。请参阅下面的代码。
library(dplyr)
out_dp <- tibble(features = as.character(bitoks)) %>%
group_by(features) %>%
tally()
library(data.table)
out_dt <- data.table(features = as.character(bitoks))
out_dt <- out_dt[, .N, by = features]