在 tabulizer 包中使用 extract_tables() 函数时出现问题:
Trouble using extract_tables() function in tabulizer package:
我正在尝试从 PDF 中 抓取 表格,但从我的本地目录而不是 web-browser(因为它没有直接在浏览器中打开)。然而,我将 pdf 下载到我的本地目录并尝试从那里只阅读我的表格![=20=]
当我 运行 我的代码:
PATH <-"C:\Users\gabrielburcea\Rprojects\Reports_scraping\data_scraped\icnarc_29052020\icnarc_200529.pdf"
test <- extract_tables(PATH, output = "data.frame", pages = c(10, 11))
我收到以下在 Internet 上找不到的错误:
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class), :
java.io.FileNotFoundException: /Library/Frameworks/R.framework/Versions/3.6/Resources/library (Is a directory)
有办法解决这个问题吗?
我正在尝试 抓取 的 .pdf
已从 this website.
下载到我的计算机
该报告标题为 ICNARC COVID-19 report 2020-05-29.pdf
,可以使用页面 right-side 的 link 下载。
下面是 traceback()
在我收到错误消息后的输出。
8: stop(list(message = "java.io.FileNotFoundException: /Library/Frameworks/R.framework/Versions/3.6/Resources/library (Is a directory)",
call = .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance",
.jfindClass(class), .jarray(p, "java/lang/Object", dispatch = FALSE),
.jarray(pc, "java/lang/Class", dispatch = FALSE), evalString = FALSE,
evalArray = FALSE, use.true.class = TRUE), jobj = new("jobjRef",
jobj = <pointer: 0x7fd1ba0972b0>, jclass = "java/io/FileNotFoundException")))
7: .jcheck(silent = FALSE)
6: .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class),
.jarray(p, "java/lang/Object", dispatch = FALSE), .jarray(pc,
"java/lang/Class", dispatch = FALSE), evalString = FALSE,
evalArray = FALSE, use.true.class = TRUE)
5: .J(Class@name, ...)
4: new(J("java.io.FileInputStream"), name <- localfile)
3: new(J("java.io.FileInputStream"), name <- localfile)
2: load_doc(file, password = password, copy = copy)
1: extract_tables(PATH, output = "data.frame", pages = c(10, 11))
和sessionInfo()
returns这个:
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.5 purrr_0.3.3 readr_1.3.1 tidyr_1.0.3 tibble_3.0.1
[8] ggplot2_3.3.0 tidyverse_1.3.0 tabulizer_0.2.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.4.3 compiler_3.6.1 dbplyr_1.4.2
[6] tools_3.6.1 lubridate_1.7.4 jsonlite_1.6 lifecycle_0.2.0 gtable_0.3.0
[11] nlme_3.1-140 lattice_0.20-38 pkgconfig_2.0.2 png_0.1-7 rlang_0.4.6
[16] reprex_0.3.0 cli_2.0.2 DBI_1.0.0 rstudioapi_0.11 haven_2.2.0
[21] rJava_0.9-12 withr_2.1.2 xml2_1.3.2 httr_1.4.1 fs_1.3.1
[26] hms_0.5.3 generics_0.0.2 vctrs_0.3.0 grid_3.6.1 tidyselect_1.1.0
[31] glue_1.3.1 R6_2.4.0 fansi_0.4.0 readxl_1.3.1 modelr_0.1.8
[36] magrittr_1.5 scales_1.0.0 tabulizerjars_1.0.1 backports_1.1.4 ellipsis_0.3.0
[41] rvest_0.3.5 assertthat_0.2.1 colorspace_1.4-1 stringi_1.4.6 munsell_0.5.0
[46] broom_0.5.6 crayon_1.3.4
在此先感谢您的帮助!
如评论中所述,代码在 windows 上运行良好。
library(tabulizer)
link <- "https://www.icnarc.org/DataServices/Attachments/Download/8419d345-c7a1-ea11-9126-00505601089b"
dfr.list <- extract_tables(link, output="data.frame", pages=10:11)
要从列表中取出每个 table,请使用 list2env
,您将 env=
ironment 设置为 .GlobalEnv
,这是您的工作空间 getwd()
。事先你需要给未命名的列表名称。
names(dfr.list) <- paste0("dfr", 1:length(dfr.list)) ## give names
list2env(dfr.list, envir=.GlobalEnv) ## put to environment
ls()
# [1] "dfr.list" "dfr1" "dfr2" "dfr3" "link"
# [2] "tables.list"
.pdf 提取通常并不完美,之后我们必须清理数据。要改善结果,请尝试使用 area=
、columns=
和 extract_tables
的选项,阅读帮助页面 ?extract_tables
,查阅文档。
我正在尝试从 PDF 中 抓取 表格,但从我的本地目录而不是 web-browser(因为它没有直接在浏览器中打开)。然而,我将 pdf 下载到我的本地目录并尝试从那里只阅读我的表格![=20=]
当我 运行 我的代码:
PATH <-"C:\Users\gabrielburcea\Rprojects\Reports_scraping\data_scraped\icnarc_29052020\icnarc_200529.pdf"
test <- extract_tables(PATH, output = "data.frame", pages = c(10, 11))
我收到以下在 Internet 上找不到的错误:
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class), :
java.io.FileNotFoundException: /Library/Frameworks/R.framework/Versions/3.6/Resources/library (Is a directory)
有办法解决这个问题吗?
我正在尝试 抓取 的 .pdf
已从 this website.
下载到我的计算机
该报告标题为 ICNARC COVID-19 report 2020-05-29.pdf
,可以使用页面 right-side 的 link 下载。
下面是 traceback()
在我收到错误消息后的输出。
8: stop(list(message = "java.io.FileNotFoundException: /Library/Frameworks/R.framework/Versions/3.6/Resources/library (Is a directory)",
call = .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance",
.jfindClass(class), .jarray(p, "java/lang/Object", dispatch = FALSE),
.jarray(pc, "java/lang/Class", dispatch = FALSE), evalString = FALSE,
evalArray = FALSE, use.true.class = TRUE), jobj = new("jobjRef",
jobj = <pointer: 0x7fd1ba0972b0>, jclass = "java/io/FileNotFoundException")))
7: .jcheck(silent = FALSE)
6: .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class),
.jarray(p, "java/lang/Object", dispatch = FALSE), .jarray(pc,
"java/lang/Class", dispatch = FALSE), evalString = FALSE,
evalArray = FALSE, use.true.class = TRUE)
5: .J(Class@name, ...)
4: new(J("java.io.FileInputStream"), name <- localfile)
3: new(J("java.io.FileInputStream"), name <- localfile)
2: load_doc(file, password = password, copy = copy)
1: extract_tables(PATH, output = "data.frame", pages = c(10, 11))
和sessionInfo()
returns这个:
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.5 purrr_0.3.3 readr_1.3.1 tidyr_1.0.3 tibble_3.0.1
[8] ggplot2_3.3.0 tidyverse_1.3.0 tabulizer_0.2.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.4.3 compiler_3.6.1 dbplyr_1.4.2
[6] tools_3.6.1 lubridate_1.7.4 jsonlite_1.6 lifecycle_0.2.0 gtable_0.3.0
[11] nlme_3.1-140 lattice_0.20-38 pkgconfig_2.0.2 png_0.1-7 rlang_0.4.6
[16] reprex_0.3.0 cli_2.0.2 DBI_1.0.0 rstudioapi_0.11 haven_2.2.0
[21] rJava_0.9-12 withr_2.1.2 xml2_1.3.2 httr_1.4.1 fs_1.3.1
[26] hms_0.5.3 generics_0.0.2 vctrs_0.3.0 grid_3.6.1 tidyselect_1.1.0
[31] glue_1.3.1 R6_2.4.0 fansi_0.4.0 readxl_1.3.1 modelr_0.1.8
[36] magrittr_1.5 scales_1.0.0 tabulizerjars_1.0.1 backports_1.1.4 ellipsis_0.3.0
[41] rvest_0.3.5 assertthat_0.2.1 colorspace_1.4-1 stringi_1.4.6 munsell_0.5.0
[46] broom_0.5.6 crayon_1.3.4
在此先感谢您的帮助!
如评论中所述,代码在 windows 上运行良好。
library(tabulizer)
link <- "https://www.icnarc.org/DataServices/Attachments/Download/8419d345-c7a1-ea11-9126-00505601089b"
dfr.list <- extract_tables(link, output="data.frame", pages=10:11)
要从列表中取出每个 table,请使用 list2env
,您将 env=
ironment 设置为 .GlobalEnv
,这是您的工作空间 getwd()
。事先你需要给未命名的列表名称。
names(dfr.list) <- paste0("dfr", 1:length(dfr.list)) ## give names
list2env(dfr.list, envir=.GlobalEnv) ## put to environment
ls()
# [1] "dfr.list" "dfr1" "dfr2" "dfr3" "link"
# [2] "tables.list"
.pdf 提取通常并不完美,之后我们必须清理数据。要改善结果,请尝试使用 area=
、columns=
和 extract_tables
的选项,阅读帮助页面 ?extract_tables
,查阅文档。