从截取的 pdf 创建列并削减空间

Creating columns from scraped pdf with cuts on spaces

我正在尝试从以下 PDF 创建数据框

library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)

然而,当我调用 tab1 它只有一列:

      [,1]                                                                     
 [1,] "NYS DOCCS INCARCERATED INDIVIDUALS COVID-19 REPORT BY REPORTED FACILITY"
 [2,] "AS OF JUNE 29, 2020 AT 3:00 PM"                                         
 [3,] "POSITIVE CASE STATUS OTHER TESTS"                                       
 [4,] "TOTAL"                                                                  
 [5,] "FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE"                  
 [6,] "TOTAL 495 16 519 97 805"                                                
 [7,] "ADIRONDACK 0 0 0 75 0"                                                  
 [8,] "ALBION 0 0 0 0 2"                                                       
 [9,] "ALTONA 0 0 0 0 1"  

                                                 

我想提取各个列应该是什么来创建数据框(例如,对于第 7 行,我将其内容提取到以下列中:Facility ("Adirondack") Recovered (0) Decesased (0) Positive ( 0) 待定 (75) 否定 (0) )。我认为最有效的方法是根据 spaces 在 tab1 中进行削减,但这不起作用,因为一些设施中有多个单词,所以 space cut 会搞砸。有没有人有解决方案的想法?感谢您的帮助!

这是一个解决方法:

library(tabulizer)

url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)

plouf <- tab1[[1]][6:dim(tab1[[1]])[1],] 
plouf <- gsub("([A-Z]+) ([A-Z]+)","\1_\2",plouf)
df <- read.table(text = paste0(t(plouf) ,collapse = "\n\r"),sep = " ")
names(df) <- strsplit(tab1[[1]][5,]," ")[[1]]

           FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE
1             TOTAL       495       16      519      97      805
2        ADIRONDACK         0        0        0      75        0
3            ALBION         0        0        0       0        2
4            ALTONA         0        0        0       0        1
5            ATTICA         2        0        2       1        7
6            AUBURN         0        0        0       0       10
7         BARE_HILL         0        0        0       0        6
8     BEDFORD_HILLS        43        1       44       5       53
9      CAPE_VINCENT         0        0        0       0        0
10           CAYUGA         0        0        0       2        1
11          CLINTON         1        0        1       0       25
12          COLLINS         1        0        1       0       13
13        COXSACKIE         1        0        1       0       57
14        DOWNSTATE         1        0        1       0       12
15          EASTERN        17        1       20       0       17
16        EDGECOMBE         0        0        0       0        0
17           ELMIRA         0        0        0       1       20
18         FISHKILL        78        5       83       4       98
19      FIVE_POINTS         0        0        0       0        4
20         FRANKLIN         1        0        1       0       24

我把标题后面的table去掉,然后去掉FACILITY名字之间的space换成gsub(我其实是换成了_,因此您可以根据需要在之后更改为 space。您也可以使用 stringr 中的 str_replace 而不是 gsub)。

然后我使用 read.table,强制文本在每行之后有一个行尾。我在后面添加名称(因为如果没有,它们会在 gsubread.table 中更改,不能正确读取它们)。

下面是我如何使用 table 从 tabulizer 包中提取的“格子”方法来处理这个问题。

#install.packages("tidyverse")
library(tidyverse)
#install.packages("janitor")
library(janitor)
#install.packages("tabulizer")
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- tabulizer::extract_tables(url, method = "lattice") %>% 
  as.data.frame() %>%
  dplyr::slice(-1,-2) %>% 
  janitor::row_to_names(row_number = 1)