使用嵌套信息在 R 中抓取 PDF

Scraping PDF in R with Nested Information

我正在尝试使用 pdftools::pdf_texttabulizer::extract_tables 在 R 中抓取一个相当困难的 PDF。但是,在我的情况下,根据 PDF 的性质,这些似乎都不太有用。 PDF包含“嵌套”信息,如图所示。

解决这个问题的最佳方法是什么?由白色 space 使用 stringr::str_split_fixedn=3 分割给了我矩阵,但是创建一个正则表达式来检测我想要的信息似乎太难了(仅在 Description 和 Incident Date/Time) 每列内。

我认为正则表达式方法并没有那么复杂:

library(pdftools)
library(tidyverse)
library(magrittr)
mylog <- "https://www.lsu.edu/police/files/crime-log/2021/jan2021.pdf"
pdf.text <- pdf_text(mylog)
map_dfr(pdf.text, ~ {
  str_split(.x,"\n") %>% unlist() -> vectors;
  vectors %>% str_detect("^Case") %>% which %>% add(1) -> cases
  vectors %>% str_detect("^Desc") %>% which %>% add(1) -> descriptions
  vectors %>% str_detect("^Addr") %>% which %>% add(1) -> addresses
  vectors[cases] %>% str_split("(\s{2,}|\s(?=[0-9]{1,2}/)|(?<=[AP]M)\s+)") %>%
    map_dfr(~setNames(.,c("Case.Number","Date.Report","Date.Incident","Case.Status")[seq_along(.)])) -> cases
  vectors[descriptions] %>% str_split("\s{2,}") %>%
    map_dfr(~setNames(.,c("Description","Date.Incident.End")[seq_along(.)])) -> descriptions
  bind_cols(cases,descriptions,data.frame(Address = vectors[addresses]))
  })
# A tibble: 155 x 7
   Case.Number  Date.Report     Date.Incident     Case.Status Description                   Date.Incident.End   Address             
   <chr>        <chr>           <chr>             <chr>       <chr>                         <chr>               <chr>               
 1 20210101-001 January 01, 20… 1/1/2021 10:28:0… Inactive    COMPLAINT ANIMAL              1/1/2021 10:28:00AM UREC FIELDS         
 2 20210101-002 January 01, 20… 1/1/2021 2:48:00… Inactive    911 HNGUP/OP - 911 HANG-UP/O… 1/1/2021 2:48:00PM  PMAC                
 3 20210101-003 January 01, 20… 1/1/2021 3:27:00… Pending     UNAUTHORIZED ENTRY OF A PLAC… 1/1/2021 3:27:00PM  COMPANION ANIMAL AL…
 4 20210102-001 January 02, 20… 1/2/2021 5:12:00… Inactive    SUSPICIOUS INCIDENT           1/2/2021 5:12:00PM  TIGER STADIUM       
 5 20210103-001 January 03, 20… 12/23/2020 12:00… Pending     HIT AND RUN                   1/3/2021 9:15:00AM  BROUSSARD HALL TRAF…
 6 20210103-002 January 03, 20… 1/3/2021 9:28:46… Inactive    DISTURBANCE                   1/3/2021 9:28:00PM  VET SCHOOL          
 7 20210104-001 January 04, 20… 11/23/2018 11:00… Inactive    NONCRIMINAL INFORMATION ONLY  11/23/2018 11:00:0… Oaks Lot            
 8 20210104-002 January 04, 20… 1/4/2021 7:26:00… Inactive    SUSPICIOUS INCIDENT           1/4/2021 7:26:00AM  ECE                 
 9 20210104-003 January 04, 20… 8/1/2017 12:00:0… Pending     INVESTIGATN - INVESTIGATION   1/2/2021 3:00:00PM  EAST CAMPUS APARTME…
10 20210104-004 January 04, 20… 1/4/2021 12:30:0… Pending     HIT AND RUN                   1/4/2021 12:30:00PM HIGHLAND ROAD @ STU…
# … with 145 more rows