如何在内存中 trim(裁剪)PDF 文档的底部空白
How to trim (crop) bottom whitespace of a PDF document, in memory
我正在使用 wkhtmltopdf
将(Django 模板化的)HTML 文档呈现为单页 PDF 文件。我想立即以正确的高度渲染它(到目前为止我没能做到)或者错误地渲染它并 trim 它。我正在使用 Python.
尝试类型 1:
wkhtmltopdf
使用 --page-height
渲染为包含大量额外 space 的非常非常长的单页 PDF
- 使用
pdfCropMargins
到trim:crop(["-p4", "100", "0", "100", "100", "-a4", "0", "-28", "0", "0", "input.pdf"])
PDF 完美呈现,底部有 28 个单位的边距,但我不得不使用文件系统来执行 crop
命令。该工具似乎需要输入文件和输出文件,并且还会在中途创建临时文件。所以我不能使用它。
尝试类型 2:
wkhtmltopdf
使用默认参数呈现为多页 PDF
- 使用
PyPDF4
(或PyPDF2
)读取文件并将页面组合成一个长的单页
在大多数情况下,PDF 渲染得很好,但是,如果碰巧最后一个 PDF 页面内容很少,有时会在底部看到很多额外的白色 space。
理想场景:
理想的场景将涉及一个函数,该函数接受 HTML 并将其呈现为单页 PDF,底部有预期数量的白色 space。我很乐意使用 wkhtmltopdf
渲染 PDF,因为它 returns 字节,然后处理这些字节以删除任何额外的白色 space。但是我不想涉及文件系统,而是我想在内存中执行所有操作。也许我可以以某种方式直接检查 PDF 并手动删除白色 space,或者做一些 HTML 魔术来预先确定渲染高度?
我现在在做什么:
请注意 pdfkit
是一个 wkhtmltopdf
包装器
# This is not a valid HTML (includes Django-specific stuff)
template: Template = get_template("some-django-template.html")
# This is now valid HTML
rendered = template.render({
"foo": "bar",
})
# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
return pdfkit.from_string(rendered, options={
"page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
"page-width": "210mm"
})
它等同于Attempt type 2
,除了我在这里不使用PyDPF4
将页面拼接在一起,而是使用预先计算的页面高度wkhtmltopdf
再次渲染。
可能有更好的方法来做到这一点,但是 3 天的赏金没有答案,这至少有效。
我假设您能够自己裁剪 PDF,而我在这里所做的只是确定在最后一页的下方还有多少内容。如果这个假设是错误的,我可能会想出如何裁剪 PDF。或者,只需裁剪图像(在 Pillow 中很容易)然后将其转换为 PDF?
此外,如果您有一个大 PDF,您可能需要弄清楚文本在整个 PDF 中到底有多远结束。我只是想了解内容在 最后一页 的末尾有多远。但是从一个转换到另一个就像一道简单的算术题。
测试代码:
import pdfkit
from PyPDF2 import PdfFileReader
from io import BytesIO
# This library isn't named fitz on pypi,
# obtain this library with `pip install PyMuPDF==1.19.4`
import fitz
# `pip install Pillow==8.3.1`
from PIL import Image
import numpy as np
# However you arrive at valid HTML, it makes no difference to the solution.
rendered = "<html><head></head><body><h3>Hello World</h3><p>hello</p></body></html>"
# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
pdf_bytes = pdfkit.from_string(rendered, options={
"page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
"page-width": "210mm"
})
# convert the pdf into an image.
pdf = fitz.open(stream=pdf_bytes, filetype="pdf")
last_page = pdf[pdf.pageCount-1]
matrix = fitz.Matrix(1, 1)
image_pixels = last_page.get_pixmap(matrix=matrix, colorspace="GRAY")
image = Image.frombytes("L", [image_pixels.width, image_pixels.height], image_pixels.samples)
#Uncomment if you want to see.
#image.show()
# Now figure out where the end of the text is:
# First binarize. This might not be the most efficient way to do this.
# But it's how I do it.
THRESHOLD = 100
# I wrote this code ages ago and don't remember the details but
# basically, we treat every pixel > 100 as a white pixel,
# We convert the result to a true/false matrix
# And then invert that.
# The upshot is that, at the end, a value of "True"
# in the matrix will represent a black pixel in that location.
binary_matrix = np.logical_not(image.point( lambda p: 255 if p > THRESHOLD else 0 ).convert("1"))
# Now find last white row, starting at the bottom
row_count, column_count = binary_matrix.shape
last_row = 0
for i, row in enumerate(reversed(binary_matrix)):
if any(row):
last_row = i
break
else:
continue
percentage_from_top = (1 - last_row / row_count) * 100
print(percentage_from_top)
# Now you know where the page ends.
# Go back and crop the PDF accordingly.
我正在使用 wkhtmltopdf
将(Django 模板化的)HTML 文档呈现为单页 PDF 文件。我想立即以正确的高度渲染它(到目前为止我没能做到)或者错误地渲染它并 trim 它。我正在使用 Python.
尝试类型 1:
wkhtmltopdf
使用--page-height
渲染为包含大量额外 space 的非常非常长的单页 PDF
- 使用
pdfCropMargins
到trim:crop(["-p4", "100", "0", "100", "100", "-a4", "0", "-28", "0", "0", "input.pdf"])
PDF 完美呈现,底部有 28 个单位的边距,但我不得不使用文件系统来执行 crop
命令。该工具似乎需要输入文件和输出文件,并且还会在中途创建临时文件。所以我不能使用它。
尝试类型 2:
wkhtmltopdf
使用默认参数呈现为多页 PDF- 使用
PyPDF4
(或PyPDF2
)读取文件并将页面组合成一个长的单页
在大多数情况下,PDF 渲染得很好,但是,如果碰巧最后一个 PDF 页面内容很少,有时会在底部看到很多额外的白色 space。
理想场景:
理想的场景将涉及一个函数,该函数接受 HTML 并将其呈现为单页 PDF,底部有预期数量的白色 space。我很乐意使用 wkhtmltopdf
渲染 PDF,因为它 returns 字节,然后处理这些字节以删除任何额外的白色 space。但是我不想涉及文件系统,而是我想在内存中执行所有操作。也许我可以以某种方式直接检查 PDF 并手动删除白色 space,或者做一些 HTML 魔术来预先确定渲染高度?
我现在在做什么:
请注意 pdfkit
是一个 wkhtmltopdf
包装器
# This is not a valid HTML (includes Django-specific stuff)
template: Template = get_template("some-django-template.html")
# This is now valid HTML
rendered = template.render({
"foo": "bar",
})
# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
return pdfkit.from_string(rendered, options={
"page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
"page-width": "210mm"
})
它等同于Attempt type 2
,除了我在这里不使用PyDPF4
将页面拼接在一起,而是使用预先计算的页面高度wkhtmltopdf
再次渲染。
可能有更好的方法来做到这一点,但是 3 天的赏金没有答案,这至少有效。 我假设您能够自己裁剪 PDF,而我在这里所做的只是确定在最后一页的下方还有多少内容。如果这个假设是错误的,我可能会想出如何裁剪 PDF。或者,只需裁剪图像(在 Pillow 中很容易)然后将其转换为 PDF? 此外,如果您有一个大 PDF,您可能需要弄清楚文本在整个 PDF 中到底有多远结束。我只是想了解内容在 最后一页 的末尾有多远。但是从一个转换到另一个就像一道简单的算术题。
测试代码:
import pdfkit
from PyPDF2 import PdfFileReader
from io import BytesIO
# This library isn't named fitz on pypi,
# obtain this library with `pip install PyMuPDF==1.19.4`
import fitz
# `pip install Pillow==8.3.1`
from PIL import Image
import numpy as np
# However you arrive at valid HTML, it makes no difference to the solution.
rendered = "<html><head></head><body><h3>Hello World</h3><p>hello</p></body></html>"
# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
pdf_bytes = pdfkit.from_string(rendered, options={
"page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
"page-width": "210mm"
})
# convert the pdf into an image.
pdf = fitz.open(stream=pdf_bytes, filetype="pdf")
last_page = pdf[pdf.pageCount-1]
matrix = fitz.Matrix(1, 1)
image_pixels = last_page.get_pixmap(matrix=matrix, colorspace="GRAY")
image = Image.frombytes("L", [image_pixels.width, image_pixels.height], image_pixels.samples)
#Uncomment if you want to see.
#image.show()
# Now figure out where the end of the text is:
# First binarize. This might not be the most efficient way to do this.
# But it's how I do it.
THRESHOLD = 100
# I wrote this code ages ago and don't remember the details but
# basically, we treat every pixel > 100 as a white pixel,
# We convert the result to a true/false matrix
# And then invert that.
# The upshot is that, at the end, a value of "True"
# in the matrix will represent a black pixel in that location.
binary_matrix = np.logical_not(image.point( lambda p: 255 if p > THRESHOLD else 0 ).convert("1"))
# Now find last white row, starting at the bottom
row_count, column_count = binary_matrix.shape
last_row = 0
for i, row in enumerate(reversed(binary_matrix)):
if any(row):
last_row = i
break
else:
continue
percentage_from_top = (1 - last_row / row_count) * 100
print(percentage_from_top)
# Now you know where the page ends.
# Go back and crop the PDF accordingly.