使用 Python 计算 PDF 中的图像数量
Count number of images in PDF with Python
我正在尝试使用 Python 计算 PDF 中的图像数量,并将结果写入 csv 文件。理想情况下,我想要 return 一个 csv,它显示文件的一列和每页的一列以及每页中的图像数量。但是显示文件名和文档中图像总数的列就足够了。
我试过:
import fitz
import io
from PIL import Image
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
# Declaring the writer
propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
# Writing the headers
propertyWriter.writerow(['file', 'results', 'error'])
for file in pdfs:
# open the file
pdf_file = fitz.open(file)
# printing number of images found in this page
if image_list:
results = len(image_list[0])
error = ""
#print(results)
#results = str(f"+ Found a total of {len(image_list)} images in page {page_index}")
else:
error = str("! No images found on page", page_index)
propertyWriter.writerow([file, results, error])
参考:https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/
但是,使用此选项声明每个 PDF 中有 9 张图像,但事实并非如此。
然后我尝试了:
import fitz
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
# Declaring the writer
propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
# Writing the headers
propertyWriter.writerow(['file', 'results'])
for file in pdfs[0:5]:
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
results = str(pix)
propertyWriter.writerow([file, results])
参考:Extract images from PDF without resampling, in python?
但这又是说每个 PDF 中的图像数量相同,事实并非如此。
我尝试了您提到的第一个参考文献 (https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/),它运行良好(该页面上的代码)。有什么不对吗?它计算 PDF 中每一页的图像,您只需将每个 pdf 的图像加在一起?
如果你把这个放到for循环里,应该能达到你的目的吧?
import fitz
import io
from PIL import Image
file = "doctest.pdf"
pdf_file = fitz.open(file)
results = 0
for page_index in range(len(pdf_file)):
image_list = pdf_file[page_index].getImageList()
# printing number of images found in this page
if image_list:
results += len(image_list)
print("Total images in this PDF: ", results)
我正在尝试使用 Python 计算 PDF 中的图像数量,并将结果写入 csv 文件。理想情况下,我想要 return 一个 csv,它显示文件的一列和每页的一列以及每页中的图像数量。但是显示文件名和文档中图像总数的列就足够了。
我试过:
import fitz
import io
from PIL import Image
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
# Declaring the writer
propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
# Writing the headers
propertyWriter.writerow(['file', 'results', 'error'])
for file in pdfs:
# open the file
pdf_file = fitz.open(file)
# printing number of images found in this page
if image_list:
results = len(image_list[0])
error = ""
#print(results)
#results = str(f"+ Found a total of {len(image_list)} images in page {page_index}")
else:
error = str("! No images found on page", page_index)
propertyWriter.writerow([file, results, error])
参考:https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/ 但是,使用此选项声明每个 PDF 中有 9 张图像,但事实并非如此。
然后我尝试了:
import fitz
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
# Declaring the writer
propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
# Writing the headers
propertyWriter.writerow(['file', 'results'])
for file in pdfs[0:5]:
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
results = str(pix)
propertyWriter.writerow([file, results])
参考:Extract images from PDF without resampling, in python? 但这又是说每个 PDF 中的图像数量相同,事实并非如此。
我尝试了您提到的第一个参考文献 (https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/),它运行良好(该页面上的代码)。有什么不对吗?它计算 PDF 中每一页的图像,您只需将每个 pdf 的图像加在一起?
如果你把这个放到for循环里,应该能达到你的目的吧?
import fitz
import io
from PIL import Image
file = "doctest.pdf"
pdf_file = fitz.open(file)
results = 0
for page_index in range(len(pdf_file)):
image_list = pdf_file[page_index].getImageList()
# printing number of images found in this page
if image_list:
results += len(image_list)
print("Total images in this PDF: ", results)