如何在 Python 中将 PDF 转换为灰度
How to convert a PDF to grayscale in Python
是否可以使用 Python 库将 PDF 文件转换为等效的灰度文件?我试过 ghostscript 模块:
import locale
from io import BytesIO
import ghostscript as gs
ENCO = locale.getpreferredencoding()
STDOUT = BytesIO()
STDERR = BytesIO()
with open('adob_in.pdf', 'r') as infile:
ARGS = f"""DUMMY -sOutputFile=adob_out.pdf -sDEVICE=pdfwrite
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray
-dNOPAUSE -dBATCH {infile.name}"""
ARGSB = [arg.encode(ENCO) for arg in ARGS.split()]
gs.Ghostscript(*ARGSB, stdout=STDOUT, stderr=STDERR)
print(STDOUT.getvalue().decode(ENCO))
print(STDERR.getvalue().decode(ENCO))
标准输出和错误流是:
GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
很遗憾,灰度 PDF 已损坏。事实上,使用 Ghostscript 调试显示以下错误:
GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
**** Error: Cannot find a 'startxref' anywhere in the file.
Output may be incorrect.
**** Error: An error occurred while reading an XREF table.
**** The file has been damaged. This may have been caused
**** by a problem while converting or transfering the file.
**** Ghostscript will attempt to recover the data.
**** However, the output may be incorrect.
**** Error: Trailer dictionary not found.
Output may be incorrect.
No pages will be processed (FirstPage > LastPage).
**** This file had errors that were repaired or ignored.
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
**** The rendered output from this file may be incorrect.
GS>
请注意,字符串 ARGS
包含有效的 ghostscript 代码(在 Linux 命令行中测试,使用 GPL Ghostscript 9.52
)并且 ARGSB
只是相应的二进制表示字符串:
print(ARGSB)
[b'DUMMY', b'-sOutputFile=adob_out.pdf', b'-sDEVICE=pdfwrite', b'-sColorConversionStrategy=Gray', b'-dProcessColorModel=/DeviceGray', b'-dNOPAUSE', b'-dBATCH', b'adob_in.pdf']
如何正确完成这项任务?我的示例输入和输出文件可以在 here 中找到。非常感谢您。
我不知道如何通过 ghostscript 做到这一点,但下面使用 pdf2image and img2pdf 的代码可以达到目的:
from os.path import join
from tempfile import TemporaryDirectory
from pdf2image import convert_from_path # https://pypi.org/project/pdf2image/
from img2pdf import convert # https://pypi.org/project/img2pdf/
with TemporaryDirectory() as temp_dir: # Saves images temporarily in disk rather than RAM to speed up parsing
# Converting pages to images
print("Parsing pages to grayscale images. This may take a while")
images = convert_from_path(
"your_pdf_path.pdf",
output_folder=temp_dir,
grayscale=True,
fmt="jpeg",
thread_count=4
)
image_list = list()
for page_number in range(1, len(images) + 1):
path = join(temp_dir, "page_" + str(page_number) + ".jpeg")
image_list.append(path)
images[page_number-1].save(path, "JPEG") # (page_number - 1) because index starts from 0
with open("Gray_PDF.pdf", "bw") as gray_pdf:
gray_pdf.write(convert(image_list))
print("The new page is saved as Gray_PDF.pdf in the current directory.")
带有灰度图像的 PDF 文件将在同一目录中另存为 Gray_PDF.pdf。
解释:
以下代码:
with TemporaryDirectory() as temp_dir: # Saves images temporarily in disk rather than RAM. This speeds up parsing
# Converting pages to images
print("Parsing pages to grayscale images. This may take a while")
images = convert_from_path(
"your_pdf_path.pdf",
output_folder=temp_dir,
grayscale=True,
fmt="jpeg",
thread_count=4
)
执行以下任务:
- 将 PDF 页面转换为灰度图像。
- 临时存放在一个目录中。
- 创建一个列表
images
PIL 图像对象
现在代码如下:
image_list = list()
for page_number in range(1, len(images) + 1):
path = join(temp_dir, "page_" + str(page_number) + ".jpeg")
image_list.append(path)
images[page_number-1].save(path, "JPEG") # (page_number - 1) because index starts from 0
将图像再次保存为 page_1.jpeg、page_2.jpeg 等在同一目录中。它还列出了这些新图像的路径。
最后,代码如下:
with open("Gray_PDF.pdf", "bw") as gray_pdf:
gray_pdf.write(convert(image_list))
从之前创建的灰度图像创建一个名为 Gray_PDF 的 PDF,并将其保存在工作目录中。
附加提示: 如果您想使用 OpenCV 执行更多图像处理操作,此方法可为您提供很大的灵活性,因为所有页面现在都是图像形式。只需确保所有操作都在第一个 with
语句内,即以下内容:
with TemporaryDirectory() as temp_dir: # Saves images temporarily in disk rather than RAM. This speeds up parsing
是否可以使用 Python 库将 PDF 文件转换为等效的灰度文件?我试过 ghostscript 模块:
import locale
from io import BytesIO
import ghostscript as gs
ENCO = locale.getpreferredencoding()
STDOUT = BytesIO()
STDERR = BytesIO()
with open('adob_in.pdf', 'r') as infile:
ARGS = f"""DUMMY -sOutputFile=adob_out.pdf -sDEVICE=pdfwrite
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray
-dNOPAUSE -dBATCH {infile.name}"""
ARGSB = [arg.encode(ENCO) for arg in ARGS.split()]
gs.Ghostscript(*ARGSB, stdout=STDOUT, stderr=STDERR)
print(STDOUT.getvalue().decode(ENCO))
print(STDERR.getvalue().decode(ENCO))
标准输出和错误流是:
GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
很遗憾,灰度 PDF 已损坏。事实上,使用 Ghostscript 调试显示以下错误:
GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
**** Error: Cannot find a 'startxref' anywhere in the file.
Output may be incorrect.
**** Error: An error occurred while reading an XREF table.
**** The file has been damaged. This may have been caused
**** by a problem while converting or transfering the file.
**** Ghostscript will attempt to recover the data.
**** However, the output may be incorrect.
**** Error: Trailer dictionary not found.
Output may be incorrect.
No pages will be processed (FirstPage > LastPage).
**** This file had errors that were repaired or ignored.
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
**** The rendered output from this file may be incorrect.
GS>
请注意,字符串 ARGS
包含有效的 ghostscript 代码(在 Linux 命令行中测试,使用 GPL Ghostscript 9.52
)并且 ARGSB
只是相应的二进制表示字符串:
print(ARGSB)
[b'DUMMY', b'-sOutputFile=adob_out.pdf', b'-sDEVICE=pdfwrite', b'-sColorConversionStrategy=Gray', b'-dProcessColorModel=/DeviceGray', b'-dNOPAUSE', b'-dBATCH', b'adob_in.pdf']
如何正确完成这项任务?我的示例输入和输出文件可以在 here 中找到。非常感谢您。
我不知道如何通过 ghostscript 做到这一点,但下面使用 pdf2image and img2pdf 的代码可以达到目的:
from os.path import join
from tempfile import TemporaryDirectory
from pdf2image import convert_from_path # https://pypi.org/project/pdf2image/
from img2pdf import convert # https://pypi.org/project/img2pdf/
with TemporaryDirectory() as temp_dir: # Saves images temporarily in disk rather than RAM to speed up parsing
# Converting pages to images
print("Parsing pages to grayscale images. This may take a while")
images = convert_from_path(
"your_pdf_path.pdf",
output_folder=temp_dir,
grayscale=True,
fmt="jpeg",
thread_count=4
)
image_list = list()
for page_number in range(1, len(images) + 1):
path = join(temp_dir, "page_" + str(page_number) + ".jpeg")
image_list.append(path)
images[page_number-1].save(path, "JPEG") # (page_number - 1) because index starts from 0
with open("Gray_PDF.pdf", "bw") as gray_pdf:
gray_pdf.write(convert(image_list))
print("The new page is saved as Gray_PDF.pdf in the current directory.")
带有灰度图像的 PDF 文件将在同一目录中另存为 Gray_PDF.pdf。
解释: 以下代码:
with TemporaryDirectory() as temp_dir: # Saves images temporarily in disk rather than RAM. This speeds up parsing
# Converting pages to images
print("Parsing pages to grayscale images. This may take a while")
images = convert_from_path(
"your_pdf_path.pdf",
output_folder=temp_dir,
grayscale=True,
fmt="jpeg",
thread_count=4
)
执行以下任务:
- 将 PDF 页面转换为灰度图像。
- 临时存放在一个目录中。
- 创建一个列表
images
PIL 图像对象
现在代码如下:
image_list = list()
for page_number in range(1, len(images) + 1):
path = join(temp_dir, "page_" + str(page_number) + ".jpeg")
image_list.append(path)
images[page_number-1].save(path, "JPEG") # (page_number - 1) because index starts from 0
将图像再次保存为 page_1.jpeg、page_2.jpeg 等在同一目录中。它还列出了这些新图像的路径。
最后,代码如下:
with open("Gray_PDF.pdf", "bw") as gray_pdf:
gray_pdf.write(convert(image_list))
从之前创建的灰度图像创建一个名为 Gray_PDF 的 PDF,并将其保存在工作目录中。
附加提示: 如果您想使用 OpenCV 执行更多图像处理操作,此方法可为您提供很大的灵活性,因为所有页面现在都是图像形式。只需确保所有操作都在第一个 with
语句内,即以下内容:
with TemporaryDirectory() as temp_dir: # Saves images temporarily in disk rather than RAM. This speeds up parsing