Quanteda 警告:结果的列数不是向量长度的倍数 (arg 2030)
Quanteda warning: number of columns of result is not a multiple of vector length (arg 2030)
尝试使用 R
中的 readtext library
(随 quanteda library
一起提供)解析 7000 多个 txt 文件时,我收到了以下警告。
Warning message: In (function (..., deparse.level = 1) : number of
columns of result is not a multiple of vector length (arg 2030)
我如何找出导致警告的 txt 文件?
如果出现警告,则使用详细选项不会显示。为了您的信息,尝试解析两个文件我得到以下信息(b2w 如果我一次只解析 1 个文档,则不会显示警告)。
Reading texts from
/Users/OS/surfdrive/Competenties/Data-step-1/BinnenlandsBestuur/1982/9-12/Office
Lens 20170308-102311.jpg.txtReading texts from
/Users/OS/surfdrive/Competenties/Data-step-1/BinnenlandsBestuur/1983/Office
Lens 20170308-103518.jpg.txt, using glob pattern ... reading (txt)
file: Office Lens 20170308-102311.jpg.txt , using glob pattern ...
reading (txt) file: Office Lens 20170308-103518.jpg.txt read 2
documents. Warning messages: 1: In (function (..., deparse.level = 1)
: number of columns of result is not a multiple of vector length
(arg 2) 2: In if (verbosity == 2 & nchar(msg) > 70) pad <-
paste0("\n", pad) : the condition has length > 1 and only the first
element will be used
Session info
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] C/C/C/C/C/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm.plugin.webmining_1.3 XML_3.98-1.7 readtext_0.50 RoogleVision_0.0.1.1
[5] outliers_0.14 stringdist_0.9.4.4 ltm_1.0-0 polycor_0.7-9
[9] msm_1.6.4 MASS_7.3-47 psych_1.7.5 WriteXLS_4.0.0
[13] plyr_1.8.4 metafor_2.0-0 Matrix_1.2-9 metaSEM_0.9.14
[17] OpenMx_2.7.12 xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8
[21] readxl_1.0.0 quanteda_0.9.9-65 koRpus.lang.nl_0.01-3 koRpus_0.11-1
[25] sylly_0.1-1 jsonlite_1.5 httr_1.2.1
loaded via a namespace (and not attached):
[1] sylly.ru_0.1-1 splines_3.4.0 ellipse_0.3-8 RcppParallel_4.3.20 shiny_1.0.3
[6] sylly.it_0.1-1 expm_0.999-2 sylly.es_0.1-1 cellranger_1.1.0 slam_0.1-40
[11] yaml_2.1.14 backports_1.1.0 lattice_0.20-35 digest_0.6.12 googleAuthR_0.5.1
[16] colorspace_1.3-2 htmltools_0.3.6 httpuv_1.3.3 tm_0.7-1 devtools_1.13.2
[21] xtable_1.8-2 mvtnorm_1.0-6 scales_0.4.1 tibble_1.3.3 openssl_0.9.6
[26] ggplot2_2.2.1 withr_1.0.2 lazyeval_0.2.0 NLP_0.1-10 mnormt_1.5-5
[31] RJSONIO_1.3-0 survival_2.41-3 magrittr_1.5 mime_0.5 memoise_1.1.0
[36] evaluate_0.10 boilerpipeR_1.3 nlme_3.1-131 foreign_0.8-67 rsconnect_0.8
[41] tools_3.4.0 data.table_1.10.4 stringr_1.2.0 munsell_0.4.3 compiler_3.4.0
[46] rlang_0.1.1 grid_3.4.0 RCurl_1.95-4.8 bitops_1.0-6 rmarkdown_1.5
[51] gtable_0.2.0 curl_2.6 R6_2.2.2 sylly.en_0.1-1 knitr_1.16
[56] fastmatch_1.1-0 sylly.fr_0.1-1 rprojroot_1.2 stringi_1.1.5 parallel_3.4.0
[61] sylly.de_0.1-1 Rcpp_0.12.11
谢谢,
彼得
PS。如果此信息不够充分,我将 post 在 github 页面上提供一个可重现的示例。
您可以使用 purrr
来查找与您想要的不匹配的列。
首先让我们用一个与其他三个文件同名的文件创建一些演示数据...
library(tidyverse)
library(purrr)
library(stringr)
old_wd <- getwd()
setwd(tempdir())
demo_data <- tibble(x = rnorm(327),
y = rnorm(327),
z = rnorm(327))
write_csv(demo_data, "demo1.csv")
write_csv(demo_data, "demo2.csv")
write_csv(demo_data, "demo3.csv")
bad_data <-
tibble(
x = rnorm(327),
y = rnorm(327),
z = rnorm(327),
extra_column = rnorm(327)
)
write_csv(bad_data, "demo4.csv")
现在定义列名应该是什么。对于此示例,正确的名称是 x
、y
和 z
、
correct_names <- c("x", "y", "z")
此函数将读取 csv 并检查是否所有名称都与 correct_names
中的列名称相匹配。
get_csv_names <- function(path){
c(path, all(names(read_csv(path)) == correct_names))
}
我假设您想处理工作目录中的所有 csv 文件。否则你会想要改变 files
的值从我下面...
files <- list.files() %>%
tbl_df() %>%
filter(str_detect(value, ".csv")) %>%
pull()
现在只需将 files
映射到函数 get_csv_names
即可。请注意 demo4.csv 的值为 FALSE
,这意味着它的列名与您在 correct_names
...
中指定的不匹配
map(files, get_csv_names)
# [[1]]
# [1] "demo1.csv" "TRUE"
#
# [[2]]
# [1] "demo2.csv" "TRUE"
#
# [[3]]
# [1] "demo3.csv" "TRUE"
#
# [[4]]
# [1] "demo4.csv" "FALSE"
由于我们在开始时更改了工作目录,因此最好在最后重新设置它。
setwd(old_wd)
尝试使用 R
中的 readtext library
(随 quanteda library
一起提供)解析 7000 多个 txt 文件时,我收到了以下警告。
Warning message: In (function (..., deparse.level = 1) : number of columns of result is not a multiple of vector length (arg 2030)
我如何找出导致警告的 txt 文件?
如果出现警告,则使用详细选项不会显示。为了您的信息,尝试解析两个文件我得到以下信息(b2w 如果我一次只解析 1 个文档,则不会显示警告)。
Reading texts from /Users/OS/surfdrive/Competenties/Data-step-1/BinnenlandsBestuur/1982/9-12/Office Lens 20170308-102311.jpg.txtReading texts from /Users/OS/surfdrive/Competenties/Data-step-1/BinnenlandsBestuur/1983/Office Lens 20170308-103518.jpg.txt, using glob pattern ... reading (txt) file: Office Lens 20170308-102311.jpg.txt , using glob pattern ... reading (txt) file: Office Lens 20170308-103518.jpg.txt read 2 documents. Warning messages: 1: In (function (..., deparse.level = 1) : number of columns of result is not a multiple of vector length (arg 2) 2: In if (verbosity == 2 & nchar(msg) > 70) pad <- paste0("\n", pad) : the condition has length > 1 and only the first element will be used
Session info
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] C/C/C/C/C/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm.plugin.webmining_1.3 XML_3.98-1.7 readtext_0.50 RoogleVision_0.0.1.1
[5] outliers_0.14 stringdist_0.9.4.4 ltm_1.0-0 polycor_0.7-9
[9] msm_1.6.4 MASS_7.3-47 psych_1.7.5 WriteXLS_4.0.0
[13] plyr_1.8.4 metafor_2.0-0 Matrix_1.2-9 metaSEM_0.9.14
[17] OpenMx_2.7.12 xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8
[21] readxl_1.0.0 quanteda_0.9.9-65 koRpus.lang.nl_0.01-3 koRpus_0.11-1
[25] sylly_0.1-1 jsonlite_1.5 httr_1.2.1
loaded via a namespace (and not attached):
[1] sylly.ru_0.1-1 splines_3.4.0 ellipse_0.3-8 RcppParallel_4.3.20 shiny_1.0.3
[6] sylly.it_0.1-1 expm_0.999-2 sylly.es_0.1-1 cellranger_1.1.0 slam_0.1-40
[11] yaml_2.1.14 backports_1.1.0 lattice_0.20-35 digest_0.6.12 googleAuthR_0.5.1
[16] colorspace_1.3-2 htmltools_0.3.6 httpuv_1.3.3 tm_0.7-1 devtools_1.13.2
[21] xtable_1.8-2 mvtnorm_1.0-6 scales_0.4.1 tibble_1.3.3 openssl_0.9.6
[26] ggplot2_2.2.1 withr_1.0.2 lazyeval_0.2.0 NLP_0.1-10 mnormt_1.5-5
[31] RJSONIO_1.3-0 survival_2.41-3 magrittr_1.5 mime_0.5 memoise_1.1.0
[36] evaluate_0.10 boilerpipeR_1.3 nlme_3.1-131 foreign_0.8-67 rsconnect_0.8
[41] tools_3.4.0 data.table_1.10.4 stringr_1.2.0 munsell_0.4.3 compiler_3.4.0
[46] rlang_0.1.1 grid_3.4.0 RCurl_1.95-4.8 bitops_1.0-6 rmarkdown_1.5
[51] gtable_0.2.0 curl_2.6 R6_2.2.2 sylly.en_0.1-1 knitr_1.16
[56] fastmatch_1.1-0 sylly.fr_0.1-1 rprojroot_1.2 stringi_1.1.5 parallel_3.4.0
[61] sylly.de_0.1-1 Rcpp_0.12.11
谢谢, 彼得
PS。如果此信息不够充分,我将 post 在 github 页面上提供一个可重现的示例。
您可以使用 purrr
来查找与您想要的不匹配的列。
首先让我们用一个与其他三个文件同名的文件创建一些演示数据...
library(tidyverse)
library(purrr)
library(stringr)
old_wd <- getwd()
setwd(tempdir())
demo_data <- tibble(x = rnorm(327),
y = rnorm(327),
z = rnorm(327))
write_csv(demo_data, "demo1.csv")
write_csv(demo_data, "demo2.csv")
write_csv(demo_data, "demo3.csv")
bad_data <-
tibble(
x = rnorm(327),
y = rnorm(327),
z = rnorm(327),
extra_column = rnorm(327)
)
write_csv(bad_data, "demo4.csv")
现在定义列名应该是什么。对于此示例,正确的名称是 x
、y
和 z
、
correct_names <- c("x", "y", "z")
此函数将读取 csv 并检查是否所有名称都与 correct_names
中的列名称相匹配。
get_csv_names <- function(path){
c(path, all(names(read_csv(path)) == correct_names))
}
我假设您想处理工作目录中的所有 csv 文件。否则你会想要改变 files
的值从我下面...
files <- list.files() %>%
tbl_df() %>%
filter(str_detect(value, ".csv")) %>%
pull()
现在只需将 files
映射到函数 get_csv_names
即可。请注意 demo4.csv 的值为 FALSE
,这意味着它的列名与您在 correct_names
...
map(files, get_csv_names)
# [[1]]
# [1] "demo1.csv" "TRUE"
#
# [[2]]
# [1] "demo2.csv" "TRUE"
#
# [[3]]
# [1] "demo3.csv" "TRUE"
#
# [[4]]
# [1] "demo4.csv" "FALSE"
由于我们在开始时更改了工作目录,因此最好在最后重新设置它。
setwd(old_wd)