在 R 中创建文档特征矩阵需要很长时间
Creating document-feature matrix takes very long in R
我正在尝试在 R 中创建一个包含字符级双字母组的文档特征矩阵。我的代码的最后一行永远 运行 并且永远不会完成。其他线路最多需要不到一分钟。我不知道该怎么办。如有任何建议,我们将不胜感激。
代码:
library(quanteda)
#Tokenise corpus by characters
character_level_tokens = quanteda::tokens(corpus,
what = "character",
remove_punct = T,
remove_symbols = T,
remove_numbers = T,
remove_url = T,
remove_separators = T,
split_hyphens = T)
#Convert tokens to characters
character_level_tokens = as.character(character_level_tokens)
#Keep A-Z, a-z letters
character_level_tokens = gsub("[^A-Za-z]","",character_level_tokens)
#Extract character-level bigrams
final_data_char_bigram = char_ngrams(character_level_tokens, n = 2L, concatenator = "")
#Create document-feature matrix (DFM)
dfm.final_data_char_bigram = dfm(final_data_char_bigram)
length(final_data_char_bigram)
[1] 37115571
head(final_data_char_bigram)
[1] "lo" "ov" "ve" "el" "ly" "yt"
我没有您的输入语料库或可重现的示例,但这是获得您想要的结果的方法。如果这对您的大型语料库也不起作用,我会感到非常惊讶。第一种方法使用 quantead 中的选择和 ngram 构造,而第二种方法使用 tokenizers 包中的字符 shingle tokenizer。
library("quanteda")
## Package version: 2.0.1
dfm.final_data_char_bigram <- data_corpus_inaugural %>%
tokens(what = "character") %>%
tokens_keep("[A-Za-z]", valuetype = "regex") %>%
tokens_ngrams(n = 2, concatenator = "") %>%
dfm()
dfm.final_data_char_bigram
## Document-feature matrix of: 58 documents, 545 features (26.4% sparse) and 4 docvars.
## features
## docs fe el ll lo ow wc ci it ti iz
## 1789-Washington 20 31 34 12 15 3 29 85 118 5
## 1793-Washington 1 1 7 1 4 1 2 8 12 1
## 1797-Adams 24 52 44 25 24 3 23 160 214 7
## 1801-Jefferson 34 49 60 35 31 7 34 91 116 8
## 1805-Jefferson 26 57 64 27 37 8 34 130 163 11
## 1809-Madison 11 29 37 15 17 1 21 62 82 3
## [ reached max_ndoc ... 52 more documents, reached max_nfeat ... 535 more features ]
# another way
dfm.final_data_char_bigram2 <- data_corpus_inaugural %>%
tokenizers::tokenize_character_shingles(n = 2) %>%
as.tokens() %>%
dfm()
dfm.final_data_char_bigram2
## Document-feature matrix of: 58 documents, 701 features (41.9% sparse).
## features
## docs fe el ll lo ow wc ci it ti iz
## 1789-Washington 20 31 34 12 15 3 29 85 118 5
## 1793-Washington 1 1 7 1 4 1 2 8 12 1
## 1797-Adams 24 52 44 25 24 3 23 160 214 7
## 1801-Jefferson 34 49 60 35 31 7 34 91 116 8
## 1805-Jefferson 26 57 64 27 37 8 34 130 163 11
## 1809-Madison 11 29 37 15 17 1 21 62 82 3
## [ reached max_ndoc ... 52 more documents, reached max_nfeat ... 691 more features ]
我正在尝试在 R 中创建一个包含字符级双字母组的文档特征矩阵。我的代码的最后一行永远 运行 并且永远不会完成。其他线路最多需要不到一分钟。我不知道该怎么办。如有任何建议,我们将不胜感激。
代码:
library(quanteda)
#Tokenise corpus by characters
character_level_tokens = quanteda::tokens(corpus,
what = "character",
remove_punct = T,
remove_symbols = T,
remove_numbers = T,
remove_url = T,
remove_separators = T,
split_hyphens = T)
#Convert tokens to characters
character_level_tokens = as.character(character_level_tokens)
#Keep A-Z, a-z letters
character_level_tokens = gsub("[^A-Za-z]","",character_level_tokens)
#Extract character-level bigrams
final_data_char_bigram = char_ngrams(character_level_tokens, n = 2L, concatenator = "")
#Create document-feature matrix (DFM)
dfm.final_data_char_bigram = dfm(final_data_char_bigram)
length(final_data_char_bigram)
[1] 37115571
head(final_data_char_bigram)
[1] "lo" "ov" "ve" "el" "ly" "yt"
我没有您的输入语料库或可重现的示例,但这是获得您想要的结果的方法。如果这对您的大型语料库也不起作用,我会感到非常惊讶。第一种方法使用 quantead 中的选择和 ngram 构造,而第二种方法使用 tokenizers 包中的字符 shingle tokenizer。
library("quanteda")
## Package version: 2.0.1
dfm.final_data_char_bigram <- data_corpus_inaugural %>%
tokens(what = "character") %>%
tokens_keep("[A-Za-z]", valuetype = "regex") %>%
tokens_ngrams(n = 2, concatenator = "") %>%
dfm()
dfm.final_data_char_bigram
## Document-feature matrix of: 58 documents, 545 features (26.4% sparse) and 4 docvars.
## features
## docs fe el ll lo ow wc ci it ti iz
## 1789-Washington 20 31 34 12 15 3 29 85 118 5
## 1793-Washington 1 1 7 1 4 1 2 8 12 1
## 1797-Adams 24 52 44 25 24 3 23 160 214 7
## 1801-Jefferson 34 49 60 35 31 7 34 91 116 8
## 1805-Jefferson 26 57 64 27 37 8 34 130 163 11
## 1809-Madison 11 29 37 15 17 1 21 62 82 3
## [ reached max_ndoc ... 52 more documents, reached max_nfeat ... 535 more features ]
# another way
dfm.final_data_char_bigram2 <- data_corpus_inaugural %>%
tokenizers::tokenize_character_shingles(n = 2) %>%
as.tokens() %>%
dfm()
dfm.final_data_char_bigram2
## Document-feature matrix of: 58 documents, 701 features (41.9% sparse).
## features
## docs fe el ll lo ow wc ci it ti iz
## 1789-Washington 20 31 34 12 15 3 29 85 118 5
## 1793-Washington 1 1 7 1 4 1 2 8 12 1
## 1797-Adams 24 52 44 25 24 3 23 160 214 7
## 1801-Jefferson 34 49 60 35 31 7 34 91 116 8
## 1805-Jefferson 26 57 64 27 37 8 34 130 163 11
## 1809-Madison 11 29 37 15 17 1 21 62 82 3
## [ reached max_ndoc ... 52 more documents, reached max_nfeat ... 691 more features ]