如何在内存中 trim（裁剪）PDF 文档的底部空白

Question

我正在使用 wkhtmltopdf 将（Django 模板化的）HTML 文档呈现为单页 PDF 文件。我想立即以正确的高度渲染它（到目前为止我没能做到）或者错误地渲染它并 trim 它。我正在使用 Python.

尝试类型 1：

wkhtmltopdf 使用 --page-height
使用pdfCropMargins到trim：crop(["-p4", "100", "0", "100", "100", "-a4", "0", "-28", "0", "0", "input.pdf"])

PDF 完美呈现，底部有 28 个单位的边距，但我不得不使用文件系统来执行 crop 命令。该工具似乎需要输入文件和输出文件，并且还会在中途创建临时文件。所以我不能使用它。

尝试类型 2：

wkhtmltopdf 使用默认参数呈现为多页 PDF
使用PyPDF4（或PyPDF2）读取文件并将页面组合成一个长的单页

在大多数情况下，PDF 渲染得很好，但是，如果碰巧最后一个 PDF 页面内容很少，有时会在底部看到很多额外的白色 space。

理想场景：

理想的场景将涉及一个函数，该函数接受 HTML 并将其呈现为单页 PDF，底部有预期数量的白色 space。我很乐意使用 wkhtmltopdf 渲染 PDF，因为它 returns 字节，然后处理这些字节以删除任何额外的白色 space。但是我不想涉及文件系统，而是我想在内存中执行所有操作。也许我可以以某种方式直接检查 PDF 并手动删除白色 space，或者做一些 HTML 魔术来预先确定渲染高度？

我现在在做什么：

请注意 pdfkit 是一个 wkhtmltopdf 包装器

# This is not a valid HTML (includes Django-specific stuff)
template: Template = get_template("some-django-template.html")

# This is now valid HTML
rendered = template.render({
    "foo": "bar",
})

# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
return pdfkit.from_string(rendered, options={
    "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
    "page-width": "210mm"
})

它等同于Attempt type 2，除了我在这里不使用PyDPF4将页面拼接在一起，而是使用预先计算的页面高度wkhtmltopdf再次渲染。

Answer 1

可能有更好的方法来做到这一点，但是 3 天的赏金没有答案，这至少有效。我假设您能够自己裁剪 PDF，而我在这里所做的只是确定在最后一页的下方还有多少内容。如果这个假设是错误的，我可能会想出如何裁剪 PDF。或者，只需裁剪图像（在 Pillow 中很容易）然后将其转换为 PDF？此外，如果您有一个大 PDF，您可能需要弄清楚文本在整个 PDF 中到底有多远结束。我只是想了解内容在 最后一页 的末尾有多远。但是从一个转换到另一个就像一道简单的算术题。

测试代码：

import pdfkit
from PyPDF2 import PdfFileReader
from io import BytesIO

# This library isn't named fitz on pypi,
# obtain this library with `pip install PyMuPDF==1.19.4`
import fitz

# `pip install Pillow==8.3.1`
from PIL import Image

import numpy as np

# However you arrive at valid HTML, it makes no difference to the solution.
rendered = "<html><head></head><body><h3>Hello World</h3><p>hello</p></body></html>"

# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
pdf_bytes = pdfkit.from_string(rendered, options={
    "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
    "page-width": "210mm"
})

# convert the pdf into an image.
pdf = fitz.open(stream=pdf_bytes, filetype="pdf")
last_page = pdf[pdf.pageCount-1]
matrix = fitz.Matrix(1, 1)
image_pixels = last_page.get_pixmap(matrix=matrix, colorspace="GRAY")

image = Image.frombytes("L", [image_pixels.width, image_pixels.height], image_pixels.samples)

#Uncomment if you want to see.
#image.show()

# Now figure out where the end of the text is:

# First binarize. This might not be the most efficient way to do this.
# But it's how I do it.
THRESHOLD = 100
# I wrote this code ages ago and don't remember the details but
# basically, we treat every pixel > 100 as a white pixel, 
# We convert the result to a true/false matrix 
# And then invert that. 
# The upshot is that, at the end, a value of "True" 
# in the matrix will represent a black pixel in that location.
binary_matrix = np.logical_not(image.point( lambda p: 255 if p > THRESHOLD else 0 ).convert("1"))

# Now find last white row, starting at the bottom
row_count, column_count = binary_matrix.shape

last_row = 0
for i, row in enumerate(reversed(binary_matrix)):
    if any(row):
        last_row = i
        break
    else:
        continue 

percentage_from_top = (1 - last_row / row_count) * 100
print(percentage_from_top)

# Now you know where the page ends.
# Go back and crop the PDF accordingly.

如何在内存中 trim（裁剪）PDF 文档的底部空白

How to trim (crop) bottom whitespace of a PDF document, in memory

html

python

pdf

wkhtmltopdf

尝试类型 1：

尝试类型 2：

理想场景：

我现在在做什么：