使用嵌套信息在 R 中抓取 PDF
Scraping PDF in R with Nested Information
我正在尝试使用 pdftools::pdf_text
和 tabulizer::extract_tables
在 R 中抓取一个相当困难的 PDF。但是,在我的情况下,根据 PDF 的性质,这些似乎都不太有用。 PDF包含“嵌套”信息,如图所示。
解决这个问题的最佳方法是什么?由白色 space 使用 stringr::str_split_fixed
和 n=3
分割给了我矩阵,但是创建一个正则表达式来检测我想要的信息似乎太难了(仅在 Description 和 Incident Date/Time) 每列内。
我认为正则表达式方法并没有那么复杂:
library(pdftools)
library(tidyverse)
library(magrittr)
mylog <- "https://www.lsu.edu/police/files/crime-log/2021/jan2021.pdf"
pdf.text <- pdf_text(mylog)
map_dfr(pdf.text, ~ {
str_split(.x,"\n") %>% unlist() -> vectors;
vectors %>% str_detect("^Case") %>% which %>% add(1) -> cases
vectors %>% str_detect("^Desc") %>% which %>% add(1) -> descriptions
vectors %>% str_detect("^Addr") %>% which %>% add(1) -> addresses
vectors[cases] %>% str_split("(\s{2,}|\s(?=[0-9]{1,2}/)|(?<=[AP]M)\s+)") %>%
map_dfr(~setNames(.,c("Case.Number","Date.Report","Date.Incident","Case.Status")[seq_along(.)])) -> cases
vectors[descriptions] %>% str_split("\s{2,}") %>%
map_dfr(~setNames(.,c("Description","Date.Incident.End")[seq_along(.)])) -> descriptions
bind_cols(cases,descriptions,data.frame(Address = vectors[addresses]))
})
# A tibble: 155 x 7
Case.Number Date.Report Date.Incident Case.Status Description Date.Incident.End Address
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 20210101-001 January 01, 20… 1/1/2021 10:28:0… Inactive COMPLAINT ANIMAL 1/1/2021 10:28:00AM UREC FIELDS
2 20210101-002 January 01, 20… 1/1/2021 2:48:00… Inactive 911 HNGUP/OP - 911 HANG-UP/O… 1/1/2021 2:48:00PM PMAC
3 20210101-003 January 01, 20… 1/1/2021 3:27:00… Pending UNAUTHORIZED ENTRY OF A PLAC… 1/1/2021 3:27:00PM COMPANION ANIMAL AL…
4 20210102-001 January 02, 20… 1/2/2021 5:12:00… Inactive SUSPICIOUS INCIDENT 1/2/2021 5:12:00PM TIGER STADIUM
5 20210103-001 January 03, 20… 12/23/2020 12:00… Pending HIT AND RUN 1/3/2021 9:15:00AM BROUSSARD HALL TRAF…
6 20210103-002 January 03, 20… 1/3/2021 9:28:46… Inactive DISTURBANCE 1/3/2021 9:28:00PM VET SCHOOL
7 20210104-001 January 04, 20… 11/23/2018 11:00… Inactive NONCRIMINAL INFORMATION ONLY 11/23/2018 11:00:0… Oaks Lot
8 20210104-002 January 04, 20… 1/4/2021 7:26:00… Inactive SUSPICIOUS INCIDENT 1/4/2021 7:26:00AM ECE
9 20210104-003 January 04, 20… 8/1/2017 12:00:0… Pending INVESTIGATN - INVESTIGATION 1/2/2021 3:00:00PM EAST CAMPUS APARTME…
10 20210104-004 January 04, 20… 1/4/2021 12:30:0… Pending HIT AND RUN 1/4/2021 12:30:00PM HIGHLAND ROAD @ STU…
# … with 145 more rows
我正在尝试使用 pdftools::pdf_text
和 tabulizer::extract_tables
在 R 中抓取一个相当困难的 PDF。但是,在我的情况下,根据 PDF 的性质,这些似乎都不太有用。 PDF包含“嵌套”信息,如图所示。
解决这个问题的最佳方法是什么?由白色 space 使用 stringr::str_split_fixed
和 n=3
分割给了我矩阵,但是创建一个正则表达式来检测我想要的信息似乎太难了(仅在 Description 和 Incident Date/Time) 每列内。
我认为正则表达式方法并没有那么复杂:
library(pdftools)
library(tidyverse)
library(magrittr)
mylog <- "https://www.lsu.edu/police/files/crime-log/2021/jan2021.pdf"
pdf.text <- pdf_text(mylog)
map_dfr(pdf.text, ~ {
str_split(.x,"\n") %>% unlist() -> vectors;
vectors %>% str_detect("^Case") %>% which %>% add(1) -> cases
vectors %>% str_detect("^Desc") %>% which %>% add(1) -> descriptions
vectors %>% str_detect("^Addr") %>% which %>% add(1) -> addresses
vectors[cases] %>% str_split("(\s{2,}|\s(?=[0-9]{1,2}/)|(?<=[AP]M)\s+)") %>%
map_dfr(~setNames(.,c("Case.Number","Date.Report","Date.Incident","Case.Status")[seq_along(.)])) -> cases
vectors[descriptions] %>% str_split("\s{2,}") %>%
map_dfr(~setNames(.,c("Description","Date.Incident.End")[seq_along(.)])) -> descriptions
bind_cols(cases,descriptions,data.frame(Address = vectors[addresses]))
})
# A tibble: 155 x 7
Case.Number Date.Report Date.Incident Case.Status Description Date.Incident.End Address
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 20210101-001 January 01, 20… 1/1/2021 10:28:0… Inactive COMPLAINT ANIMAL 1/1/2021 10:28:00AM UREC FIELDS
2 20210101-002 January 01, 20… 1/1/2021 2:48:00… Inactive 911 HNGUP/OP - 911 HANG-UP/O… 1/1/2021 2:48:00PM PMAC
3 20210101-003 January 01, 20… 1/1/2021 3:27:00… Pending UNAUTHORIZED ENTRY OF A PLAC… 1/1/2021 3:27:00PM COMPANION ANIMAL AL…
4 20210102-001 January 02, 20… 1/2/2021 5:12:00… Inactive SUSPICIOUS INCIDENT 1/2/2021 5:12:00PM TIGER STADIUM
5 20210103-001 January 03, 20… 12/23/2020 12:00… Pending HIT AND RUN 1/3/2021 9:15:00AM BROUSSARD HALL TRAF…
6 20210103-002 January 03, 20… 1/3/2021 9:28:46… Inactive DISTURBANCE 1/3/2021 9:28:00PM VET SCHOOL
7 20210104-001 January 04, 20… 11/23/2018 11:00… Inactive NONCRIMINAL INFORMATION ONLY 11/23/2018 11:00:0… Oaks Lot
8 20210104-002 January 04, 20… 1/4/2021 7:26:00… Inactive SUSPICIOUS INCIDENT 1/4/2021 7:26:00AM ECE
9 20210104-003 January 04, 20… 8/1/2017 12:00:0… Pending INVESTIGATN - INVESTIGATION 1/2/2021 3:00:00PM EAST CAMPUS APARTME…
10 20210104-004 January 04, 20… 1/4/2021 12:30:0… Pending HIT AND RUN 1/4/2021 12:30:00PM HIGHLAND ROAD @ STU…
# … with 145 more rows