Python 阅读 pdf 页面的一部分
Python read part of a pdf page
我正在尝试读取一个 pdf 文件,其中每一页都分为 3x3 的信息块,格式为
A | B | C
D | E | F
G | H | I
每个条目都分为多行。一个条目的简化示例是 this card。但是在其他 8 个插槽中会有类似的条目。
我看过 pdfminer 和 pypdf2。我没有发现 pdfminer 过于有用,但 pypdf2 给了我一些接近的东西。
import PyPDF2
from StringIO import StringIO
def getPDFContent(path):
content = ""
p = file(path, "rb")
pdf = PyPDF2.PdfFileReader(p)
numPages = pdf.getNumPages()
for i in range(numPages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
但是,这只是逐行读取文件。我想要一个解决方案,我只能阅读页面的一部分,这样我就可以阅读 A,然后是 B,然后是 C,依此类推。另外,答案 here 效果很好,但是
的顺序
列通常会扭曲,我只能逐行阅读。
假设您使用的是 pdfminer
和 pypdf2
,我假设有问题的 PDF 文件是生成的 PDF 而不是扫描的(如您给出的示例)。如果您知道以英寸为单位的列和行的大小,您可以使用 minecart
(完全披露:我写了 minecart
)。示例代码:
import minecart
# minecart units are 1/72 inch, measured from bottom-left of the page
ROW_BORDERS = (
72 * 1, # Bottom row starts 1 inch from the bottom of the page
72 * 3, # Second row starts 3 inches from the bottom of the page
72 * 5, # Third row starts 5 inches from the bottom of the page
72 * 7, # Third row ends 7 inches from the bottom of the page
)
COLUMN_BORDERS = (
72 * 8, # Third col ends 8 inches from the left of the page
72 * 6, # Third col starts 6 inches from the left of the page
72 * 4, # Second col starts 4 inches from the left of the page
72 * 2, # First col starts 2 inches from the left of the page
) # reversed so that BOXES is ordered properly
BOXES = [
(left, bot, right, top)
for top, bot in zip(ROW_BORDERS, ROW_BORDERS[1:])
for left, right in zip(COLUMN_BORDERS, COLUMN_BORDERS[1:])
]
def extract_output(page):
"""
Reads the text from page and splits it into the 9 cells.
Returns a list with 9 entries:
[A, B, C, D, E, F, G, H, I]
Each item in the tuple contains a string with all of the
text found in the cell.
"""
res = []
for box in BOXES:
strings = list(page.letterings.iter_in_bbox(box))
# We sort from top-to-bottom and then from left-to-right, based
# on the strings' top left corner
strings.sort(key=lambda x: (-x.bbox[3], x.bbox[0]))
res.append(" ".join(strings).replace(u"\xa0", " ").strip())
return res
content = []
doc = minecart.Document(open("path/to/pdf-doc.pdf", 'rb'))
for page in doc.iter_pages():
content.append(extract_output(page))
我正在尝试读取一个 pdf 文件,其中每一页都分为 3x3 的信息块,格式为
A | B | C
D | E | F
G | H | I
每个条目都分为多行。一个条目的简化示例是 this card。但是在其他 8 个插槽中会有类似的条目。
我看过 pdfminer 和 pypdf2。我没有发现 pdfminer 过于有用,但 pypdf2 给了我一些接近的东西。
import PyPDF2
from StringIO import StringIO
def getPDFContent(path):
content = ""
p = file(path, "rb")
pdf = PyPDF2.PdfFileReader(p)
numPages = pdf.getNumPages()
for i in range(numPages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
但是,这只是逐行读取文件。我想要一个解决方案,我只能阅读页面的一部分,这样我就可以阅读 A,然后是 B,然后是 C,依此类推。另外,答案 here 效果很好,但是
的顺序
列通常会扭曲,我只能逐行阅读。
假设您使用的是 pdfminer
和 pypdf2
,我假设有问题的 PDF 文件是生成的 PDF 而不是扫描的(如您给出的示例)。如果您知道以英寸为单位的列和行的大小,您可以使用 minecart
(完全披露:我写了 minecart
)。示例代码:
import minecart
# minecart units are 1/72 inch, measured from bottom-left of the page
ROW_BORDERS = (
72 * 1, # Bottom row starts 1 inch from the bottom of the page
72 * 3, # Second row starts 3 inches from the bottom of the page
72 * 5, # Third row starts 5 inches from the bottom of the page
72 * 7, # Third row ends 7 inches from the bottom of the page
)
COLUMN_BORDERS = (
72 * 8, # Third col ends 8 inches from the left of the page
72 * 6, # Third col starts 6 inches from the left of the page
72 * 4, # Second col starts 4 inches from the left of the page
72 * 2, # First col starts 2 inches from the left of the page
) # reversed so that BOXES is ordered properly
BOXES = [
(left, bot, right, top)
for top, bot in zip(ROW_BORDERS, ROW_BORDERS[1:])
for left, right in zip(COLUMN_BORDERS, COLUMN_BORDERS[1:])
]
def extract_output(page):
"""
Reads the text from page and splits it into the 9 cells.
Returns a list with 9 entries:
[A, B, C, D, E, F, G, H, I]
Each item in the tuple contains a string with all of the
text found in the cell.
"""
res = []
for box in BOXES:
strings = list(page.letterings.iter_in_bbox(box))
# We sort from top-to-bottom and then from left-to-right, based
# on the strings' top left corner
strings.sort(key=lambda x: (-x.bbox[3], x.bbox[0]))
res.append(" ".join(strings).replace(u"\xa0", " ").strip())
return res
content = []
doc = minecart.Document(open("path/to/pdf-doc.pdf", 'rb'))
for page in doc.iter_pages():
content.append(extract_output(page))