python pdf 读取的表格错误编码

python tabula error encoding for pdf read

我想从数据框中的 PDF 导出表格或获取 csv 文件。 但是我无法阅读 Python 的 PDF 文件。我需要做什么? 我试着用 Python 表格阅读 PDF

from tabula import read_pdf

df = read_pdf(name)

我接受:

> pages' argument isn't specified.Will extract only from page 1 by default.
Got stderr: Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+564 (564) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+639 (639) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+632 (632) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+657 (657) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+637 (637) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+656 (656) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+646 (646) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+653 (653) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+635 (635) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+574 (574) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+664 (664) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+631 (631) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+585 (585) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+581 (581) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+570 (570) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode

我在评论中指出 PDF 的内容有误,因为希腊文字编码不正确,

looks like a poor quality PDF from that many warnings Thus I first suggest before investing any more time in tuning for a suspect source you 1st verify that cut and paste that table may in fact result in something worth capturing. My initial assessment is you might get just the second column with numbers and there is some hidden Greek words that are not showing but the rest is garbage, thus the only valid extraction could be by using an OCR method.

因此,首先纠正该 PDF 的最佳方法是使用 OCR,但是许多 OCR 尝试也被 PDF 的现有内容误导。

所以在这种情况下,最好的解决方案是从图像中重新进行 OCR。举个例子,我打印的第一页很糟糕,但是,这是为了证明概念,图像路由可能会让你更接近你的目标。

我目前只有一种方法可以通过 200dpi 传真导出为单色 tiff,使用 grey-scale 作为 .png .pbm 或 .tif[f](不是 jpg)你会得到更好的结果

一旦转换为纯文本 docx 或 xlsx 等,它应该会生成类似这样的内容,请忽略此示例中糟糕的标题,这是在带有圆点背景的单色中使用这种粗略尝试的副产品。

很明显,结果需要一些 clean-up 来匹配输入,例如拼写检查,但应该足以通过任何进一步的方式进行文本处理。更好地选择图像分辨率和目标输出(例如 csv)可能会获得更好的可用结果,从而更接近您的问题的答案。