使用R中的pdftools将PDF转换为文本返回空字符串
Converting PDF to text with pdftools in R returning empty string
在以下示例中,PDF 中每一页的结果都是空的。
library(pdftools)
rm(list = ls())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
url = "https://reporting.standardbank.com/wp-content/uploads/2022/02/SBS72-Pricing-Supplement.pdf"
destfile = file.path(getwd(), basename(url))
download.file(url, destfile, mode = "wb")
file = list.files(path=".", pattern="pdf$")
pdf_text(file)
我不确定 PDF 文件及其扫描和保存方式是否存在问题,导致 PDF 无法阅读。
是否有像这样的 PDF 文件的解决方法或我应该考虑的更好的 package/library?
我猜问题出在它是一份扫描文件。因此,您可能需要一些 OCR 工具来从文档中提取文本和信息。一种选择是 tesseract
包:
library(tesseract)
url = "https://reporting.standardbank.com/wp-content/uploads/2022/02/SBS72-Pricing-Supplement.pdf"
eng <- tesseract("eng")
text <- tesseract::ocr(url, engine = eng)
#> Converting page 1 to file16a069b77ed2SBS72-Pricing-Supplement_1.png... done!
#> Converting page 2 to file16a069b77ed2SBS72-Pricing-Supplement_2.png... done!
#> Converting page 3 to file16a069b77ed2SBS72-Pricing-Supplement_3.png... done!
#> Converting page 4 to file16a069b77ed2SBS72-Pricing-Supplement_4.png... done!
#> Converting page 5 to file16a069b77ed2SBS72-Pricing-Supplement_5.png... done!
#> Converting page 6 to file16a069b77ed2SBS72-Pricing-Supplement_6.png... done!
#> Converting page 7 to file16a069b77ed2SBS72-Pricing-Supplement_7.png... done!
#> Converting page 8 to file16a069b77ed2SBS72-Pricing-Supplement_8.png... done!
text[[1]]
#> [1] "APPLICABLE PRICING SUPPLEMENT DATED 28 JANUARY 2022\nThe Standard Bank of South Africa Limited\n(dncorporated with limited liability under Registration Number 1962/000738/06\nin the Republic of South Africa)\nIssue of ZAR404,000,000 Senior Unsecured Floating Rate Notes due 02 February 2029\nUnder its ZAR110,000,000,000 Domestic Medium Term Note Programme\nThis document constitutes the Applicable Pricing Supplement relating to the issue of Notes described herein.\nTerms used herein shall be deemed to be defined as such for the purposes of the terms and conditions (the\n“Terms and Conditions\") set forth in the Programme Memorandum dated 24 December 2020 (the \"Programme\nMemorandum\"), as updated and amended from time to time. This Pricing Supplement must be read in\nconjunction with such Programme Memorandum. To the extent that there is any conflict or inconsistency between\nthe contents of this Pricing Supplement and the Programme Memorandum, the provisions of this Pricing\nSupplement shall prevail.\nDESCRIPTION OF THE NOTES\nl. Issuer The Standard Bank of South Africa\nLimited\n2. Debt Officer Amo Daehnke, Group Chief\nFinancial and Value Management\nOfficer of Standard Bank Group\nLimited\n3. Status of the Notes Senior Unsecured\n4. (a) Series Number 72\n(b) Tranche Number ]\n5. Aggregate Nominal Amount ZAR404,000,000\n6. Redemption/Payment Basis N/A\n7. Type of Notes Floating Rate Notes\n8. Interest Payment Basis Floating Rate\n9. Form of Notes Registered Notes\n10. Automatic/Optional Conversion from one Interest/Payment N/A\nBasis to another\nll. Issue Date 2 February 2022\n12. Business Centre Johannesburg\n13. Additional Business Centre N/A\n14. Specified Denomination ZAR]1,000,000\n15. Calculation Amount ZAR1,000,000\n16. Issue Price 100%\n17. Interest Commencement Date 02 February 2022\n18. Maturity Date 02 February 2029\n19. Maturity Period N/A\n1\n"
在以下示例中,PDF 中每一页的结果都是空的。
library(pdftools)
rm(list = ls())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
url = "https://reporting.standardbank.com/wp-content/uploads/2022/02/SBS72-Pricing-Supplement.pdf"
destfile = file.path(getwd(), basename(url))
download.file(url, destfile, mode = "wb")
file = list.files(path=".", pattern="pdf$")
pdf_text(file)
我不确定 PDF 文件及其扫描和保存方式是否存在问题,导致 PDF 无法阅读。 是否有像这样的 PDF 文件的解决方法或我应该考虑的更好的 package/library?
我猜问题出在它是一份扫描文件。因此,您可能需要一些 OCR 工具来从文档中提取文本和信息。一种选择是 tesseract
包:
library(tesseract)
url = "https://reporting.standardbank.com/wp-content/uploads/2022/02/SBS72-Pricing-Supplement.pdf"
eng <- tesseract("eng")
text <- tesseract::ocr(url, engine = eng)
#> Converting page 1 to file16a069b77ed2SBS72-Pricing-Supplement_1.png... done!
#> Converting page 2 to file16a069b77ed2SBS72-Pricing-Supplement_2.png... done!
#> Converting page 3 to file16a069b77ed2SBS72-Pricing-Supplement_3.png... done!
#> Converting page 4 to file16a069b77ed2SBS72-Pricing-Supplement_4.png... done!
#> Converting page 5 to file16a069b77ed2SBS72-Pricing-Supplement_5.png... done!
#> Converting page 6 to file16a069b77ed2SBS72-Pricing-Supplement_6.png... done!
#> Converting page 7 to file16a069b77ed2SBS72-Pricing-Supplement_7.png... done!
#> Converting page 8 to file16a069b77ed2SBS72-Pricing-Supplement_8.png... done!
text[[1]]
#> [1] "APPLICABLE PRICING SUPPLEMENT DATED 28 JANUARY 2022\nThe Standard Bank of South Africa Limited\n(dncorporated with limited liability under Registration Number 1962/000738/06\nin the Republic of South Africa)\nIssue of ZAR404,000,000 Senior Unsecured Floating Rate Notes due 02 February 2029\nUnder its ZAR110,000,000,000 Domestic Medium Term Note Programme\nThis document constitutes the Applicable Pricing Supplement relating to the issue of Notes described herein.\nTerms used herein shall be deemed to be defined as such for the purposes of the terms and conditions (the\n“Terms and Conditions\") set forth in the Programme Memorandum dated 24 December 2020 (the \"Programme\nMemorandum\"), as updated and amended from time to time. This Pricing Supplement must be read in\nconjunction with such Programme Memorandum. To the extent that there is any conflict or inconsistency between\nthe contents of this Pricing Supplement and the Programme Memorandum, the provisions of this Pricing\nSupplement shall prevail.\nDESCRIPTION OF THE NOTES\nl. Issuer The Standard Bank of South Africa\nLimited\n2. Debt Officer Amo Daehnke, Group Chief\nFinancial and Value Management\nOfficer of Standard Bank Group\nLimited\n3. Status of the Notes Senior Unsecured\n4. (a) Series Number 72\n(b) Tranche Number ]\n5. Aggregate Nominal Amount ZAR404,000,000\n6. Redemption/Payment Basis N/A\n7. Type of Notes Floating Rate Notes\n8. Interest Payment Basis Floating Rate\n9. Form of Notes Registered Notes\n10. Automatic/Optional Conversion from one Interest/Payment N/A\nBasis to another\nll. Issue Date 2 February 2022\n12. Business Centre Johannesburg\n13. Additional Business Centre N/A\n14. Specified Denomination ZAR]1,000,000\n15. Calculation Amount ZAR1,000,000\n16. Issue Price 100%\n17. Interest Commencement Date 02 February 2022\n18. Maturity Date 02 February 2029\n19. Maturity Period N/A\n1\n"