从磁盘上的大文本文件创建词汇表时出错

Question

我尝试执行 https://cran.r-project.org/web/packages/text2vec/vignettes/files-multicore.html but with my file "text" - 3.7Gb plain text, build from Wikipedia XML dump with Perl script from here - http://mattmahoney.net/dc/textdata.html

中的示例

setwd("c:/rtest")
library(text2vec)
library(doParallel)
N_WORKERS = 2
registerDoParallel(N_WORKERS)
it_files_par = ifiles_parallel(file_paths = "text")
it_token_par = itoken_parallel(it_files_par, preprocessor = tolower, tokenizer = word_tokenizer)
vocab = create_vocabulary(it_token_par)

这会导致错误：

Error in unserialize(socklist[[n]]) : error reading from connection

我有 8Gb 内存，从这个文件创建的 word2vec 模型没有任何错误。

Answer 1

首先，在单个文件上使用并行迭代器没有意义——每个文件都在单独的 R 工作进程中处理。所以这里会比 itoken 更糟糕。它还涉及将每个工作人员的结果发送到主进程。在这里我们看到结果太大而无法通过套接字发送。长话短说 - 只需使用 itoken 或将您的文件拆分成几个较小的文件。

从磁盘上的大文本文件创建词汇表时出错

Error creating vocabulary from big text file on disk

text2vec