在 R 中导入多张发票 (.PDF)。将它们从字符串变成小标题
Importing multiple invoices (.PDF) in R. Turning them from strings to a tibble
所以我正在做一个项目,我需要将大量的 .pdf 文件加载到 R 中。这部分内容有所涉及。问题是将 pdf 导入 R 时,每一行都是一个字符串。并非字符串中的所有信息都是相关的。在某些情况下,信息会丢失。所以我想 select 我需要的信息并将它们放入小标题中以供进一步分析。
导入 pdf 由 pdftools 完成。它正在工作,但欢迎提示或提示
invoice_pdfs = list.files(pattern="*.pdf") # gather all the .pdf in current wd.
invoice_list <- map(invoice_pdfs, .f = function(invoices){ # Using the purrr::map function .
pdf_text(invoices) %>% # extracting text from listed pdf file(s)
readr::read_lines() %>% # read all text from pdf
str_squish() %>% # clear all white space in text.
str_to_lower # convert string to lower case
})
可重现的例子:
invoice_example <- c("invoice",
"to: rade ris",
"cane nompany",
"kakber street 23d",
"nork wey",
"+223 (0)56 015 6542",
"invoice id: 85600023",
"date reference product product reference weigth amount",
"01-02-2016 840000023 product a 24.45.6 de6583621 14.900 kg a 50 per tonne 745,00",
"07-02-2016 840000048 product b 24.45.7 qf8463641 19.000 kg a 50 per tonne 950,00",
"03-02-2016 840000032 product b 24.34.2 qf8463641 4.000 kg per tonne 250,00",
"02-02-2016 840000027 ke7801465 1.780 kg per tonne 89,00",
"subtotal 2.034,00",
"sales tax 183,06",
"total 2.217,06")
问题就在这里。
我尝试过的是使用 stringr 和 rebus 来 select 文本的特定部分。我创建了以下函数来搜索文档中的特定字符串,它 returns 行号:
word_finder <- function(x, findWord){
word_hit <- x %>% # temp for storing TRUE or FALSE
str_detect(pattern = fixed(findWord))
which(word_hit == TRUE) # give rownumber if TRUE
}
以及以下搜索模式:
detect_date <- dgt(2) %R% "-" %R% dgt(2) %R% "-" %R% dgt(2)
detect_money <- optional(DIGIT) %R% optional(".") %R% one_or_more(DIGIT) %R% "," %R% dgt(2)
detect_invoice_num <- str_trim(SPC %R% dgt(8) %R% optional(SPC))
下一步应该是用列名制作小标题(或数据框)c("date", "reference", "product", "product reference", "weight", "amount")
我也试过制作整个小标题invoice_example
问题是缺少信息某些字段和列名与相应值不匹配。
所以我想制作一些使用搜索模式的函数,并将该特定值放在预定的列中。我不知道如何完成这项工作。或者我应该完全不同地处理这个问题?
最终结果应该是这样的。
可重现的例子:
invoice_nr <- c("85600023", "85600023", "85600023", "85600023" )
date <- c( "01-02-2016", "07-02-2016", "03-02-2016", "02-02-2016")
reference <- c( "840000023", "840000048", "840000032", "840000027")
product_id <- c( "de6583621", "qf8463641", "qf8463641", "ke7801465")
weight <- c("14.900", "19.000", "4.000", "1.780")
amount <- c("745.00", "950.00", "250.00", "89.00")
example_tibble <- tibble(invoice_nr, date, reference, product_id, weight, amount)
结果:
# A tibble: 4 x 6
invoice_nr date reference product_id weight amount
<chr> <chr> <chr> <chr> <chr> <chr>
1 85600023 01-02-2016 840000023 de6583621 14.900 745.00
2 85600023 07-02-2016 840000048 qf8463641 19.000 950.00
3 85600023 03-02-2016 840000032 qf8463641 4.000 250.00
4 85600023 02-02-2016 840000027 ke7801465 1.780 89.00
任何处理此问题的建议方法将不胜感激!
由于我不熟悉 rebus
我已经重写了您的代码。假设发票至少在某种程度上具有相同的结构,我可以从您的示例中生成 tibble
。你只需要将它应用到你的整个列表然后 purrr::reduce
它到一个大的 tibble
:
df <- tibble(date=na.omit(str_extract(invoice_example,"\d{2}-\d{2}-\d{4}")))
df %>% mutate(invoice_nr=na.omit(sub("invoice id: ","",str_extract(invoice_example,"invoice id: [0-9]+"))),
reference=na.omit(sub("\d{2}-\d{2}-\d{4} ","",str_extract(invoice_example,"\d{2}-\d{2}-\d{4} \d{9}"))),
product_id=na.omit(str_extract(invoice_example,"[:lower:]{2}\d{7}")),
weight=na.omit(sub(" kg","",str_extract(invoice_example,"[0-9\.]+ kg"))),
amount=na.omit(sub("tonne ","",str_extract(invoice_example,"tonne [0-9,]+"))))
实际上,您可以使用 library(stringr)
的功能来实现您的目标(我跳过了 rebus
部分,因为这似乎是 eb 无论如何 'just' creatign teh regex 的帮助程序,它我是亲手做的):
library(tidyverse)
parse_invoice <- function(in_text) {
## define regex, some assumptions:
## product id is 2 lower characters followed by 7 digits
## weight is some digits with a dot followed by kg
## amount is some digits at the end with a comma
all_regex <- list(date = "\d{2}-\d{2}-\d{4}",
reference = "\d{9}",
product_id = "[a-z]{2}\d{7}",
weight = "\d+\.\d+ kg",
amount = "\d+,\d+$")
## look only at lines where there is invoice data
rel_lines <- str_subset(in_text, all_regex$date)
## extract the pieces from the regex
ret <- as_tibble(map(all_regex, str_extract, string = rel_lines))
## clean up the data
ret %>%
mutate(invoice_nr = str_extract(str_subset(in_text, "invoice id:"), "\d{8}"),
date = as.Date(date, "%d-%m-%Y"),
weight = as.numeric(str_replace(weight, "(\d+.\d+) kg", "\1")),
amount = as.numeric(str_replace(amount, ",", "."))
) %>%
select(invoice_nr,
date,
reference,
product_id,
weight,
amount)
}
str(parse_invoice(invoice_example))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 6 variables:
# $ invoice_nr: chr "85600023" "85600023" "85600023" "85600023"
# $ date : Date, format: "2016-02-01" "2016-02-07" ...
# $ reference : chr "840000023" "840000048" "840000032" "840000027"
# $ product_id: chr "de6583621" "qf8463641" "qf8463641" "ke7801465"
# $ weight : num 14.9 19 4 1.78
# $ amount : num 745 950 250 89
所以我正在做一个项目,我需要将大量的 .pdf 文件加载到 R 中。这部分内容有所涉及。问题是将 pdf 导入 R 时,每一行都是一个字符串。并非字符串中的所有信息都是相关的。在某些情况下,信息会丢失。所以我想 select 我需要的信息并将它们放入小标题中以供进一步分析。
导入 pdf 由 pdftools 完成。它正在工作,但欢迎提示或提示
invoice_pdfs = list.files(pattern="*.pdf") # gather all the .pdf in current wd.
invoice_list <- map(invoice_pdfs, .f = function(invoices){ # Using the purrr::map function .
pdf_text(invoices) %>% # extracting text from listed pdf file(s)
readr::read_lines() %>% # read all text from pdf
str_squish() %>% # clear all white space in text.
str_to_lower # convert string to lower case
})
可重现的例子:
invoice_example <- c("invoice",
"to: rade ris",
"cane nompany",
"kakber street 23d",
"nork wey",
"+223 (0)56 015 6542",
"invoice id: 85600023",
"date reference product product reference weigth amount",
"01-02-2016 840000023 product a 24.45.6 de6583621 14.900 kg a 50 per tonne 745,00",
"07-02-2016 840000048 product b 24.45.7 qf8463641 19.000 kg a 50 per tonne 950,00",
"03-02-2016 840000032 product b 24.34.2 qf8463641 4.000 kg per tonne 250,00",
"02-02-2016 840000027 ke7801465 1.780 kg per tonne 89,00",
"subtotal 2.034,00",
"sales tax 183,06",
"total 2.217,06")
问题就在这里。 我尝试过的是使用 stringr 和 rebus 来 select 文本的特定部分。我创建了以下函数来搜索文档中的特定字符串,它 returns 行号:
word_finder <- function(x, findWord){
word_hit <- x %>% # temp for storing TRUE or FALSE
str_detect(pattern = fixed(findWord))
which(word_hit == TRUE) # give rownumber if TRUE
}
以及以下搜索模式:
detect_date <- dgt(2) %R% "-" %R% dgt(2) %R% "-" %R% dgt(2)
detect_money <- optional(DIGIT) %R% optional(".") %R% one_or_more(DIGIT) %R% "," %R% dgt(2)
detect_invoice_num <- str_trim(SPC %R% dgt(8) %R% optional(SPC))
下一步应该是用列名制作小标题(或数据框)c("date", "reference", "product", "product reference", "weight", "amount")
我也试过制作整个小标题invoice_example
问题是缺少信息某些字段和列名与相应值不匹配。
所以我想制作一些使用搜索模式的函数,并将该特定值放在预定的列中。我不知道如何完成这项工作。或者我应该完全不同地处理这个问题?
最终结果应该是这样的。
可重现的例子:
invoice_nr <- c("85600023", "85600023", "85600023", "85600023" )
date <- c( "01-02-2016", "07-02-2016", "03-02-2016", "02-02-2016")
reference <- c( "840000023", "840000048", "840000032", "840000027")
product_id <- c( "de6583621", "qf8463641", "qf8463641", "ke7801465")
weight <- c("14.900", "19.000", "4.000", "1.780")
amount <- c("745.00", "950.00", "250.00", "89.00")
example_tibble <- tibble(invoice_nr, date, reference, product_id, weight, amount)
结果:
# A tibble: 4 x 6
invoice_nr date reference product_id weight amount
<chr> <chr> <chr> <chr> <chr> <chr>
1 85600023 01-02-2016 840000023 de6583621 14.900 745.00
2 85600023 07-02-2016 840000048 qf8463641 19.000 950.00
3 85600023 03-02-2016 840000032 qf8463641 4.000 250.00
4 85600023 02-02-2016 840000027 ke7801465 1.780 89.00
任何处理此问题的建议方法将不胜感激!
由于我不熟悉 rebus
我已经重写了您的代码。假设发票至少在某种程度上具有相同的结构,我可以从您的示例中生成 tibble
。你只需要将它应用到你的整个列表然后 purrr::reduce
它到一个大的 tibble
:
df <- tibble(date=na.omit(str_extract(invoice_example,"\d{2}-\d{2}-\d{4}")))
df %>% mutate(invoice_nr=na.omit(sub("invoice id: ","",str_extract(invoice_example,"invoice id: [0-9]+"))),
reference=na.omit(sub("\d{2}-\d{2}-\d{4} ","",str_extract(invoice_example,"\d{2}-\d{2}-\d{4} \d{9}"))),
product_id=na.omit(str_extract(invoice_example,"[:lower:]{2}\d{7}")),
weight=na.omit(sub(" kg","",str_extract(invoice_example,"[0-9\.]+ kg"))),
amount=na.omit(sub("tonne ","",str_extract(invoice_example,"tonne [0-9,]+"))))
实际上,您可以使用 library(stringr)
的功能来实现您的目标(我跳过了 rebus
部分,因为这似乎是 eb 无论如何 'just' creatign teh regex 的帮助程序,它我是亲手做的):
library(tidyverse)
parse_invoice <- function(in_text) {
## define regex, some assumptions:
## product id is 2 lower characters followed by 7 digits
## weight is some digits with a dot followed by kg
## amount is some digits at the end with a comma
all_regex <- list(date = "\d{2}-\d{2}-\d{4}",
reference = "\d{9}",
product_id = "[a-z]{2}\d{7}",
weight = "\d+\.\d+ kg",
amount = "\d+,\d+$")
## look only at lines where there is invoice data
rel_lines <- str_subset(in_text, all_regex$date)
## extract the pieces from the regex
ret <- as_tibble(map(all_regex, str_extract, string = rel_lines))
## clean up the data
ret %>%
mutate(invoice_nr = str_extract(str_subset(in_text, "invoice id:"), "\d{8}"),
date = as.Date(date, "%d-%m-%Y"),
weight = as.numeric(str_replace(weight, "(\d+.\d+) kg", "\1")),
amount = as.numeric(str_replace(amount, ",", "."))
) %>%
select(invoice_nr,
date,
reference,
product_id,
weight,
amount)
}
str(parse_invoice(invoice_example)) # Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 6 variables: # $ invoice_nr: chr "85600023" "85600023" "85600023" "85600023" # $ date : Date, format: "2016-02-01" "2016-02-07" ... # $ reference : chr "840000023" "840000048" "840000032" "840000027" # $ product_id: chr "de6583621" "qf8463641" "qf8463641" "ke7801465" # $ weight : num 14.9 19 4 1.78 # $ amount : num 745 950 250 89