转换 quanteda dfmSparse matrix->data.frame->h2o 添加不需要的 NaN 初始行
Converting quanteda dfmSparse matrix->data.frame->h2o adds unwanted initial row of NaNs
我有一个用 quanteda. (The actual class is dfmSparse which is a subclass of dfm-matrix 创建的 10025x1417 TFIDF dfm 矩阵。
当我使用 as.data.frame 转换为 h2o,然后转换为 as.h2o 时,我错误地得到 10026x1417,第一行多余的 NaN 是不需要的。
出于性能原因,我不想创建具有完整密集矩阵的临时 df。
代码如下(我在小数据上无法复现):
library(quanteda)
mat <- quanteda::weight(theDfm, type="tfidf")
# Convert to df then h2o, correctly gives 10025x1417 matrix
mat_df <- as.data.frame(mat) # this will dispatch quanteda::as.data.frame for dfmSparse
mat_h2o <- as.h2o(mat_df)
# Convert in one go, get 10026x1417, get unwanted extra first row of NaNs
bad_h2o <- as.h2o(as.data.frame(mat))
dim(bad_h2o )
[1] 10026 1417
# Which as.data.frame method this uses
> showMethods(quanteda::as.data.frame)
Function: as.data.frame (package base)
x="ANY"
x="dfm"
x="dfmSparse"
(inherited from: x="dfm")
x="matrix"
(inherited from: x="ANY")
#########################################
# Ken Benoit requested sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] h2o_3.8.3.3 statmod_1.4.22 quanteda_0.9.8 RevoUtilsMath_3.2.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.2 lattice_0.20-33 SnowballC_0.5.1 bitops_1.0-6 chron_2.3-47 grid_3.2.3 R6_2.1.1
[8] jsonlite_0.9.19 magrittr_1.5 httr_1.0.0 stringi_1.0-1 data.table_1.9.6 ca_0.58 Matrix_1.2-3
[15] tools_3.2.3 stringr_1.0.0 RCurl_1.95-4.7 parallel_3.2.3
For performance reasons I don't want to create a temporary df with the full dense matrix.
事实上,quanteda
会在将稀疏矩阵转换为密集矩阵之前将其转换 data.frame
:https://github.com/kbenoit/quanteda/blob/master/R/dfm-classes.R#L513-L516
如果需要将稀疏矩阵导入h2o,将其转换为svmlight格式并使用importFile
。请参阅此主题:How to use H2o on feature hashed matrix in R
我有一个用 quanteda. (The actual class is dfmSparse which is a subclass of dfm-matrix 创建的 10025x1417 TFIDF dfm 矩阵。 当我使用 as.data.frame 转换为 h2o,然后转换为 as.h2o 时,我错误地得到 10026x1417,第一行多余的 NaN 是不需要的。 出于性能原因,我不想创建具有完整密集矩阵的临时 df。
代码如下(我在小数据上无法复现):
library(quanteda)
mat <- quanteda::weight(theDfm, type="tfidf")
# Convert to df then h2o, correctly gives 10025x1417 matrix
mat_df <- as.data.frame(mat) # this will dispatch quanteda::as.data.frame for dfmSparse
mat_h2o <- as.h2o(mat_df)
# Convert in one go, get 10026x1417, get unwanted extra first row of NaNs
bad_h2o <- as.h2o(as.data.frame(mat))
dim(bad_h2o )
[1] 10026 1417
# Which as.data.frame method this uses
> showMethods(quanteda::as.data.frame)
Function: as.data.frame (package base)
x="ANY"
x="dfm"
x="dfmSparse"
(inherited from: x="dfm")
x="matrix"
(inherited from: x="ANY")
#########################################
# Ken Benoit requested sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] h2o_3.8.3.3 statmod_1.4.22 quanteda_0.9.8 RevoUtilsMath_3.2.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.2 lattice_0.20-33 SnowballC_0.5.1 bitops_1.0-6 chron_2.3-47 grid_3.2.3 R6_2.1.1
[8] jsonlite_0.9.19 magrittr_1.5 httr_1.0.0 stringi_1.0-1 data.table_1.9.6 ca_0.58 Matrix_1.2-3
[15] tools_3.2.3 stringr_1.0.0 RCurl_1.95-4.7 parallel_3.2.3
For performance reasons I don't want to create a temporary df with the full dense matrix.
事实上,quanteda
会在将稀疏矩阵转换为密集矩阵之前将其转换 data.frame
:https://github.com/kbenoit/quanteda/blob/master/R/dfm-classes.R#L513-L516
如果需要将稀疏矩阵导入h2o,将其转换为svmlight格式并使用importFile
。请参阅此主题:How to use H2o on feature hashed matrix in R