使用 R 从 PDF 中提取字符串

Question

我有来自欧洲议会的 PDF 文件，你可以 download here。我已经下载了它并把它放在 R 中。它包含经过 session 投票后的欧洲议会 (MEP) 成员名单。

我只想提取这些列表的一部分。具体来说，我想提取并放入 table 位于 "AVGIVNA RÖSTER" 和 0、see the text highlighted in this screenshot.

之间的名称

PDF 中重复了一系列类似的名称。它指的是特定的投票。我希望它们都在 table 中。 MEP 的名称改变了，但结构保持不变，它们始终位于位 "AVGIVNA RÖSTER" 和“0”之间。

我想过使用 startswith 函数和 for 循环”，但我在写作方面遇到了困难。

这是我目前所做的：

library(pdftools)
library(tidyverse)

votetext <- pdftools::pdf_text("MEP.pdf") %>%
  readr::read_lines()

Answer 1

你可以试试这样的

votetext <- pdftools::pdf_text("MEP.pdf") %>%
  readr::read_lines()

a <- which(grepl("AVGIVNA RÖSTER", votetext)) #beginning of string
b <- which(grepl("^\s*0\s*$", votetext)) #end of string

sapply(a, function(x){paste(votetext[x:(min(b[b > x]))], collapse = ". ")})

请注意，在 b 的定义中，我使用 \s* 来查找字符串中的白色 space。一般来说，您可以先删除尾部和前导白色 space，请参阅 this question。

在你的情况下你可以这样做：

votetext2 <- pdftools::pdf_text("data.pdf") %>%
  readr::read_lines() %>%
  str_remove("^\s*") %>% #remove white space in the begining
  str_remove("\s*$") %>% #remove white space in the end
  str_replace_all("\s+", " ") #replace multiple white-spaces with a singe white-space

a2 <- which(votetext2 == "AVGIVNA RÖSTER")
b2 <- which(votetext2 == "0")

result <- sapply(a2, function(x){paste(votetext2[x:(min(b2[b2 > x]))], collapse = ". ")})

result 然后看起来像这样：

`"AVGIVNA RÖSTER. Martin Hojsík, Naomi Long, Margarida Marques, Pedro Marques, Manu Pineda, Ramona Strugariu, Marie Toussaint,. + Dragoş Tudorache, Marie-Pierre Vedrenne. -. Agnès Evren. 0"

使用 R 从 PDF 中提取字符串

Extracting strings from a PDF with R

regex

pdf

string

r

pdf-scraping