从截取的 pdf 创建列并削减空间
Creating columns from scraped pdf with cuts on spaces
我正在尝试从以下 PDF 创建数据框
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)
然而,当我调用 tab1
它只有一列:
[,1]
[1,] "NYS DOCCS INCARCERATED INDIVIDUALS COVID-19 REPORT BY REPORTED FACILITY"
[2,] "AS OF JUNE 29, 2020 AT 3:00 PM"
[3,] "POSITIVE CASE STATUS OTHER TESTS"
[4,] "TOTAL"
[5,] "FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE"
[6,] "TOTAL 495 16 519 97 805"
[7,] "ADIRONDACK 0 0 0 75 0"
[8,] "ALBION 0 0 0 0 2"
[9,] "ALTONA 0 0 0 0 1"
我想提取各个列应该是什么来创建数据框(例如,对于第 7 行,我将其内容提取到以下列中:Facility ("Adirondack") Recovered (0) Decesased (0) Positive ( 0) 待定 (75) 否定 (0) )。我认为最有效的方法是根据 spaces 在 tab1 中进行削减,但这不起作用,因为一些设施中有多个单词,所以 space cut 会搞砸。有没有人有解决方案的想法?感谢您的帮助!
这是一个解决方法:
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)
plouf <- tab1[[1]][6:dim(tab1[[1]])[1],]
plouf <- gsub("([A-Z]+) ([A-Z]+)","\1_\2",plouf)
df <- read.table(text = paste0(t(plouf) ,collapse = "\n\r"),sep = " ")
names(df) <- strsplit(tab1[[1]][5,]," ")[[1]]
FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE
1 TOTAL 495 16 519 97 805
2 ADIRONDACK 0 0 0 75 0
3 ALBION 0 0 0 0 2
4 ALTONA 0 0 0 0 1
5 ATTICA 2 0 2 1 7
6 AUBURN 0 0 0 0 10
7 BARE_HILL 0 0 0 0 6
8 BEDFORD_HILLS 43 1 44 5 53
9 CAPE_VINCENT 0 0 0 0 0
10 CAYUGA 0 0 0 2 1
11 CLINTON 1 0 1 0 25
12 COLLINS 1 0 1 0 13
13 COXSACKIE 1 0 1 0 57
14 DOWNSTATE 1 0 1 0 12
15 EASTERN 17 1 20 0 17
16 EDGECOMBE 0 0 0 0 0
17 ELMIRA 0 0 0 1 20
18 FISHKILL 78 5 83 4 98
19 FIVE_POINTS 0 0 0 0 4
20 FRANKLIN 1 0 1 0 24
我把标题后面的table去掉,然后去掉FACILITY
名字之间的space换成gsub
(我其实是换成了_
,因此您可以根据需要在之后更改为 space。您也可以使用 stringr
中的 str_replace
而不是 gsub
)。
然后我使用 read.table,强制文本在每行之后有一个行尾。我在后面添加名称(因为如果没有,它们会在 gsub
和 read.table
中更改,不能正确读取它们)。
下面是我如何使用 table 从 tabulizer 包中提取的“格子”方法来处理这个问题。
#install.packages("tidyverse")
library(tidyverse)
#install.packages("janitor")
library(janitor)
#install.packages("tabulizer")
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- tabulizer::extract_tables(url, method = "lattice") %>%
as.data.frame() %>%
dplyr::slice(-1,-2) %>%
janitor::row_to_names(row_number = 1)
我正在尝试从以下 PDF 创建数据框
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)
然而,当我调用 tab1
它只有一列:
[,1]
[1,] "NYS DOCCS INCARCERATED INDIVIDUALS COVID-19 REPORT BY REPORTED FACILITY"
[2,] "AS OF JUNE 29, 2020 AT 3:00 PM"
[3,] "POSITIVE CASE STATUS OTHER TESTS"
[4,] "TOTAL"
[5,] "FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE"
[6,] "TOTAL 495 16 519 97 805"
[7,] "ADIRONDACK 0 0 0 75 0"
[8,] "ALBION 0 0 0 0 2"
[9,] "ALTONA 0 0 0 0 1"
我想提取各个列应该是什么来创建数据框(例如,对于第 7 行,我将其内容提取到以下列中:Facility ("Adirondack") Recovered (0) Decesased (0) Positive ( 0) 待定 (75) 否定 (0) )。我认为最有效的方法是根据 spaces 在 tab1 中进行削减,但这不起作用,因为一些设施中有多个单词,所以 space cut 会搞砸。有没有人有解决方案的想法?感谢您的帮助!
这是一个解决方法:
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)
plouf <- tab1[[1]][6:dim(tab1[[1]])[1],]
plouf <- gsub("([A-Z]+) ([A-Z]+)","\1_\2",plouf)
df <- read.table(text = paste0(t(plouf) ,collapse = "\n\r"),sep = " ")
names(df) <- strsplit(tab1[[1]][5,]," ")[[1]]
FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE
1 TOTAL 495 16 519 97 805
2 ADIRONDACK 0 0 0 75 0
3 ALBION 0 0 0 0 2
4 ALTONA 0 0 0 0 1
5 ATTICA 2 0 2 1 7
6 AUBURN 0 0 0 0 10
7 BARE_HILL 0 0 0 0 6
8 BEDFORD_HILLS 43 1 44 5 53
9 CAPE_VINCENT 0 0 0 0 0
10 CAYUGA 0 0 0 2 1
11 CLINTON 1 0 1 0 25
12 COLLINS 1 0 1 0 13
13 COXSACKIE 1 0 1 0 57
14 DOWNSTATE 1 0 1 0 12
15 EASTERN 17 1 20 0 17
16 EDGECOMBE 0 0 0 0 0
17 ELMIRA 0 0 0 1 20
18 FISHKILL 78 5 83 4 98
19 FIVE_POINTS 0 0 0 0 4
20 FRANKLIN 1 0 1 0 24
我把标题后面的table去掉,然后去掉FACILITY
名字之间的space换成gsub
(我其实是换成了_
,因此您可以根据需要在之后更改为 space。您也可以使用 stringr
中的 str_replace
而不是 gsub
)。
然后我使用 read.table,强制文本在每行之后有一个行尾。我在后面添加名称(因为如果没有,它们会在 gsub
和 read.table
中更改,不能正确读取它们)。
下面是我如何使用 table 从 tabulizer 包中提取的“格子”方法来处理这个问题。
#install.packages("tidyverse")
library(tidyverse)
#install.packages("janitor")
library(janitor)
#install.packages("tabulizer")
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- tabulizer::extract_tables(url, method = "lattice") %>%
as.data.frame() %>%
dplyr::slice(-1,-2) %>%
janitor::row_to_names(row_number = 1)