R: dtm with ngram tokenizer plus dictionary broken in Ubuntu?
R: dtm with ngram tokenizer plus dictionary broken in Ubuntu?
我正在创建一个文档术语矩阵,其中包含字典和 ngram 标记化。它适用于我的 Windows 7 笔记本电脑,但不适用于类似配置的 Ubuntu 14.04.2 服务器。 更新:它也适用于 Centos 服务器。
library(tm)
library(RWeka)
library((SnowballC))
newBigramTokenizer = function(x) {
tokenizer1 = RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 2))
if (length(tokenizer1) != 0L) { return(tokenizer1)
} else return(WordTokenizer(x))
}
textvect <- c("this is a story about a girl",
"this is a story about a boy",
"a boy and a girl went to the store",
"a store is a place to buy things",
"you can also buy things from a boy or a girl",
"the word store can also be a verb meaning to position something for later use")
textvect <- iconv(textvect, to = "utf-8")
textsource <- VectorSource(textvect)
textcorp <- Corpus(textsource)
textdict <- c("boy", "girl", "store", "story about")
textdict <- iconv(textdict, to = "utf-8")
# OK
dtm <- DocumentTermMatrix(textcorp, control=list(dictionary=textdict))
# OK on Windows laptop
# freezes or generates error on Ubuntu server
dtm <- DocumentTermMatrix(textcorp, control=list(tokenize=newBigramTokenizer,
dictionary=textdict))
来自 Ubuntu 服务器的错误(在源示例的最后一行):
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar: invalid LOC header (bad signature)
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
'i, j' invalid
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
scheduled core 1 encountered error in user code, all values of the job will be affected
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
NAs introduced by coercion
我已经尝试了 Twitter Data Analysis - Error in Term Document Matrix 中的一些建议并且
Error in simple_triplet_matrix -- unable to use RWeka to count Phrases
我原以为我的问题可以归因于其中之一,但现在脚本 运行 在 Centos 服务器上,与有问题的 Ubuntu 服务器具有相同的语言环境和 JVM。
- 语言环境
- JVM 的细微差别
- 并行库?错误消息中提到了 mclapply,会话信息中列出了 parallel(不过对于所有系统。)
以下是两种环境:
R 版本 3.1.2 (2014-10-31)
平台:x86_64-w64-mingw32/x64(64 位)
PS C:\> java -version
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-23 tm_0.6 NLP_0.1-5
loaded via a namespace (and not attached):
[1] grid_3.1.2 parallel_3.1.2 rJava_0.9-6 RWekajars_3.7.11-1 slam_0.1-32
[6] tools_3.1.2
R 版本 3.1.2 (2014-10-31)
平台:x86_64-pc-linux-gnu(64 位)
$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8
[10] LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-23 tm_0.6 NLP_0.1-5
loaded via a namespace (and not attached):
[1] grid_3.1.2 parallel_3.1.2 rJava_0.9-6 RWekajars_3.7.11-1 slam_0.1-32
[6] tools_3.1.2
R 版本 3.2.0 (2015-04-16)
平台:x86_64-redhat-linux-gnu(64 位)
运行 下: CentOS Linux 7 (Core)
$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (rhel-2.5.5.1.el7_1-x86_64 u79-b14)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-24 tm_0.6-2 NLP_0.1-8
loaded via a namespace (and not attached):
[1] parallel_3.2.0 tools_3.2.0 slam_0.1-32 grid_3.2.0
[5] rJava_0.9-6 RWekajars_3.7.12-1
如果您更喜欢更简单但同样灵活或强大的东西,试试 quanteda 包怎么样?它可以在三行中快速完成你的字典和二元组任务:
# or: devtools::install_github("kbenoit/quanteda")
require(quanteda)
# use dictionary() to construct dictionary from named list
textdict <- dictionary(list(mydict = c("boy", "girl", "store", "story about")))
# convert to document-feature matrix, with 1grams + 2grams, apply dictionary
dfm(textvect, dictionary = textdict, ngrams = 1:2, concatenator = " ")
## Document-feature matrix of: 6 documents, 1 feature.
## 6 x 1 sparse Matrix of class "dfmSparse"
## features
## docs mydict
## text1 2
## text2 2
## text3 3
## text4 1
## text5 2
## text6 1
# alternative is to consider the dictionary as a thesaurus of synonyms,
# not exclusive in feature selection as is a dictionary
dfm.all <- dfm(textvect, thesaurus = textdict,
ngrams = 1:2, concatenator = " ", verbose = FALSE)
topfeatures(dfm.all)
## a MYDICT a boy a girl is is a to a story about about a
## 11 11 3 3 3 3 3 2 2 2
dfm_sort(dfm.all)[1:6, 1:12]
## Document-feature matrix of: 6 documents, 12 features.
## 6 x 12 sparse Matrix of class "dfmSparse"
## features
## docs a MYDICT a boy a girl is is a to a story about about a also buy
## text1 2 2 0 1 1 1 0 1 1 1 0 0
## text2 2 2 1 0 1 1 0 1 1 1 0 0
## text3 2 3 1 1 0 0 1 0 0 0 0 0
## text4 2 1 0 0 1 1 1 0 0 0 0 1
## text5 2 2 1 1 0 0 0 0 0 0 1 1
## text6 1 1 0 0 0 0 1 0 0 0 1 0
我正在创建一个文档术语矩阵,其中包含字典和 ngram 标记化。它适用于我的 Windows 7 笔记本电脑,但不适用于类似配置的 Ubuntu 14.04.2 服务器。 更新:它也适用于 Centos 服务器。
library(tm)
library(RWeka)
library((SnowballC))
newBigramTokenizer = function(x) {
tokenizer1 = RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 2))
if (length(tokenizer1) != 0L) { return(tokenizer1)
} else return(WordTokenizer(x))
}
textvect <- c("this is a story about a girl",
"this is a story about a boy",
"a boy and a girl went to the store",
"a store is a place to buy things",
"you can also buy things from a boy or a girl",
"the word store can also be a verb meaning to position something for later use")
textvect <- iconv(textvect, to = "utf-8")
textsource <- VectorSource(textvect)
textcorp <- Corpus(textsource)
textdict <- c("boy", "girl", "store", "story about")
textdict <- iconv(textdict, to = "utf-8")
# OK
dtm <- DocumentTermMatrix(textcorp, control=list(dictionary=textdict))
# OK on Windows laptop
# freezes or generates error on Ubuntu server
dtm <- DocumentTermMatrix(textcorp, control=list(tokenize=newBigramTokenizer,
dictionary=textdict))
来自 Ubuntu 服务器的错误(在源示例的最后一行):
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar: invalid LOC header (bad signature)
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
'i, j' invalid
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
scheduled core 1 encountered error in user code, all values of the job will be affected
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
NAs introduced by coercion
我已经尝试了 Twitter Data Analysis - Error in Term Document Matrix 中的一些建议并且 Error in simple_triplet_matrix -- unable to use RWeka to count Phrases
我原以为我的问题可以归因于其中之一,但现在脚本 运行 在 Centos 服务器上,与有问题的 Ubuntu 服务器具有相同的语言环境和 JVM。
- 语言环境
- JVM 的细微差别
- 并行库?错误消息中提到了 mclapply,会话信息中列出了 parallel(不过对于所有系统。)
以下是两种环境:
R 版本 3.1.2 (2014-10-31) 平台:x86_64-w64-mingw32/x64(64 位)
PS C:\> java -version
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-23 tm_0.6 NLP_0.1-5
loaded via a namespace (and not attached):
[1] grid_3.1.2 parallel_3.1.2 rJava_0.9-6 RWekajars_3.7.11-1 slam_0.1-32
[6] tools_3.1.2
R 版本 3.1.2 (2014-10-31) 平台:x86_64-pc-linux-gnu(64 位)
$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8
[10] LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-23 tm_0.6 NLP_0.1-5
loaded via a namespace (and not attached):
[1] grid_3.1.2 parallel_3.1.2 rJava_0.9-6 RWekajars_3.7.11-1 slam_0.1-32
[6] tools_3.1.2
R 版本 3.2.0 (2015-04-16) 平台:x86_64-redhat-linux-gnu(64 位) 运行 下: CentOS Linux 7 (Core)
$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (rhel-2.5.5.1.el7_1-x86_64 u79-b14)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-24 tm_0.6-2 NLP_0.1-8
loaded via a namespace (and not attached):
[1] parallel_3.2.0 tools_3.2.0 slam_0.1-32 grid_3.2.0
[5] rJava_0.9-6 RWekajars_3.7.12-1
如果您更喜欢更简单但同样灵活或强大的东西,试试 quanteda 包怎么样?它可以在三行中快速完成你的字典和二元组任务:
# or: devtools::install_github("kbenoit/quanteda")
require(quanteda)
# use dictionary() to construct dictionary from named list
textdict <- dictionary(list(mydict = c("boy", "girl", "store", "story about")))
# convert to document-feature matrix, with 1grams + 2grams, apply dictionary
dfm(textvect, dictionary = textdict, ngrams = 1:2, concatenator = " ")
## Document-feature matrix of: 6 documents, 1 feature.
## 6 x 1 sparse Matrix of class "dfmSparse"
## features
## docs mydict
## text1 2
## text2 2
## text3 3
## text4 1
## text5 2
## text6 1
# alternative is to consider the dictionary as a thesaurus of synonyms,
# not exclusive in feature selection as is a dictionary
dfm.all <- dfm(textvect, thesaurus = textdict,
ngrams = 1:2, concatenator = " ", verbose = FALSE)
topfeatures(dfm.all)
## a MYDICT a boy a girl is is a to a story about about a
## 11 11 3 3 3 3 3 2 2 2
dfm_sort(dfm.all)[1:6, 1:12]
## Document-feature matrix of: 6 documents, 12 features.
## 6 x 12 sparse Matrix of class "dfmSparse"
## features
## docs a MYDICT a boy a girl is is a to a story about about a also buy
## text1 2 2 0 1 1 1 0 1 1 1 0 0
## text2 2 2 1 0 1 1 0 1 1 1 0 0
## text3 2 3 1 1 0 0 1 0 0 0 0 0
## text4 2 1 0 0 1 1 1 0 0 0 0 1
## text5 2 2 1 1 0 0 0 0 0 0 1 1
## text6 1 1 0 0 0 0 1 0 0 0 1 0