基于字符串创建从pdf到csv的Dataframe
Create Dataframe from pdf to csv based on string
我喜欢根据冒号的存在来拆分 pdf 文档的信息。样本在这里。
可以从 this link
下载四页的更新 PDF
我正在尝试以下操作。阅读 pdf 后,我试图用冒号将其拆分。
library(textreadr)
dat <- '~Here is the thing1.pdf' %>%
textreadr::read_pdf()
dat
Source: local data frame [26 x 3]
page_id element_id text
1 1 1 Here is the thing.
2 1 2 Case ID 1
3 1 3 Exploring Angels: It is a long establish
4 1 4 page when looking at its layout. The poi
5 1 5 distribution of letters, as opposed to u
6 1 6 English. Many desktop publishing package
7 1 7 model text, and a search for 'lorem ipsu
8 1 8 versions have evolved over the years, so
9 1 9 and the like).
10 1 10 New agency: Lorem Ipsum is simply dummy
.. ... ... ...
或
library(pdftools)
dat <- pdf_text("~Here is the thing1.pdf")
dat1 <- strsplit(dat[[1]], "\n")[[1]]
head(dat1)
[1] "Here is the thing.\r"
[2] "Case ID 1\r"
[3] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a\r"
[4] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal\r"
[5] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable\r"
[6] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default\r"
dat2 <- dat1 %>%
str_split(pattern = "\r")
head(dat2)
[[1]]
[1] "Here is the thing." ""
[[2]]
[1] "Case ID 1" ""
[[3]]
[1] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a"
[2] ""
[[4]]
[1] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal"
[2] ""
[[5]]
[1] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable"
[2] ""
[[6]]
[1] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default"
[2] "
我想把我的数据整理成 table 这样的:
Case.ID Exploring.Angels New.agency New.Factor New.Factor2 Creative.One
1 1 It is a long established fact that a reader Lorem Ipsum is simply dummy text ABC BNM <NA>
2 2 Various versions have evolved It has survived not only five ABC <NA> DFZ
下面是我将如何使用 tidyverse
library(tidyverse)
# read in the file, separate by line, convert to tibble
pdftools::pdf_text("../_xlam/Here is the thing1.pdf") %>% str_split("(\r\n)") %>%
unlist() %>% as_tibble() %>%
# separate cases and mark lines containing colon
mutate(case=cumsum(str_detect(value, "Case ID")),
tag_line=str_detect(value, ": ")) %>%
# drop lines with Case ID, separate tag from text, move text into one column, fill the tags
filter(!str_detect(value,"Case ID")) %>%
separate(value, into = c("key", "text"), sep=": ", fill="right", extra="merge") %>%
mutate(text=ifelse(is.na(text), key, text),
key=ifelse(tag_line, key, NA)) %>% fill(key) %>%
# summarize text by concatenation
group_by(case, key) %>% summarise(text=paste(text, collapse = " ")) %>%
# filter away the `Here is the thing` line
drop_na(key) %>%
# move values to columns
spread(key=key, value=text)
我喜欢根据冒号的存在来拆分 pdf 文档的信息。样本在这里。
可以从 this link
下载四页的更新 PDF我正在尝试以下操作。阅读 pdf 后,我试图用冒号将其拆分。
library(textreadr)
dat <- '~Here is the thing1.pdf' %>%
textreadr::read_pdf()
dat
Source: local data frame [26 x 3]
page_id element_id text
1 1 1 Here is the thing.
2 1 2 Case ID 1
3 1 3 Exploring Angels: It is a long establish
4 1 4 page when looking at its layout. The poi
5 1 5 distribution of letters, as opposed to u
6 1 6 English. Many desktop publishing package
7 1 7 model text, and a search for 'lorem ipsu
8 1 8 versions have evolved over the years, so
9 1 9 and the like).
10 1 10 New agency: Lorem Ipsum is simply dummy
.. ... ... ...
或
library(pdftools)
dat <- pdf_text("~Here is the thing1.pdf")
dat1 <- strsplit(dat[[1]], "\n")[[1]]
head(dat1)
[1] "Here is the thing.\r"
[2] "Case ID 1\r"
[3] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a\r"
[4] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal\r"
[5] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable\r"
[6] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default\r"
dat2 <- dat1 %>%
str_split(pattern = "\r")
head(dat2)
[[1]]
[1] "Here is the thing." ""
[[2]]
[1] "Case ID 1" ""
[[3]]
[1] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a"
[2] ""
[[4]]
[1] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal"
[2] ""
[[5]]
[1] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable"
[2] ""
[[6]]
[1] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default"
[2] "
我想把我的数据整理成 table 这样的:
Case.ID Exploring.Angels New.agency New.Factor New.Factor2 Creative.One
1 1 It is a long established fact that a reader Lorem Ipsum is simply dummy text ABC BNM <NA>
2 2 Various versions have evolved It has survived not only five ABC <NA> DFZ
下面是我将如何使用 tidyverse
library(tidyverse)
# read in the file, separate by line, convert to tibble
pdftools::pdf_text("../_xlam/Here is the thing1.pdf") %>% str_split("(\r\n)") %>%
unlist() %>% as_tibble() %>%
# separate cases and mark lines containing colon
mutate(case=cumsum(str_detect(value, "Case ID")),
tag_line=str_detect(value, ": ")) %>%
# drop lines with Case ID, separate tag from text, move text into one column, fill the tags
filter(!str_detect(value,"Case ID")) %>%
separate(value, into = c("key", "text"), sep=": ", fill="right", extra="merge") %>%
mutate(text=ifelse(is.na(text), key, text),
key=ifelse(tag_line, key, NA)) %>% fill(key) %>%
# summarize text by concatenation
group_by(case, key) %>% summarise(text=paste(text, collapse = " ")) %>%
# filter away the `Here is the thing` line
drop_na(key) %>%
# move values to columns
spread(key=key, value=text)