将从每个图像中检索到的 OCR 文本写入与每个图像对应的单独文本文件

Question

我正在阅读一个 pdf 文件并将每个页面转换为图像并保存，接下来我需要运行每个图像上的 OCR 并识别每个图像文本并将其写入新的文本文件。

我知道如何从所有图像中获取所有文本并将其转储到一个文本文件中。

pdf_dir = 'dir path'
os.chdir(pdf_dir)

for pdf_file in os.listdir(pdf_dir):
    if pdf_file.endswith(".pdf"):
        pages = convert_from_path(pdf_file, 300)
        pdf_file = pdf_file[:-4]
        for page in pages:
            page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG") 

img_dir = 'dir path'
os.chdir(img_dir)

docs = []

for img_file in os.listdir(img_dir):
    if img_file.endswith(".jpg"):
        texts = str(((pytesseract.image_to_string(Image.open(img_file)))))
        text = texts.replace('-\n', '')  
        print(texts)
        img_file = img_file[:-4]
        for text in texts:
            file = img_file + ".txt"
#          create the new file with "w+" as open it
            with open(file, "w+") as f:
                for texts in docs:
                # write each element in my_list to file
                    f.write("%s" % str(texts))
                    print(file)

我需要编写一个文本文件，对应于每张图像，它已经识别出该图像中的文本。目前写入的文件都是空的，我不确定出了什么问题。有人可以帮忙吗？

Answer 1

这里有很多东西要打开：

您正在迭代 docs 这是一个空列表，以创建文本文件，因此，每个文本文件只是创建（空) 并且 file.write 永远不会执行。
你正在分配 text = texts.replace('-\n', '') 但你没有对它做任何事情，而是迭代 for text in texts 所以在 that 循环中，text 不是 replace 的结果，而是可迭代的 texts.
因为texts是一个str，每个text in texts是一个字符。
然后您使用 texts（也是之前分配的）作为 docs 的迭代器（同样，这是空的）。

2 和 4 不一定有问题，但可能不是好的做法。 1 似乎是您生成空文本文件的主要原因。 3 似乎也是一个逻辑错误，因为您几乎肯定不想将单个字符写入文件。

所以我认为这就是你想要的，但未经测试：

for img_file in os.listdir(img_dir):
    if img_file.endswith(".jpg"):
        texts = str(((pytesseract.image_to_string(Image.open(img_file)))))
        print(texts)
        file = img_file[:-4] + ".txt"
        #create the new file with "w+" as open it
        with open(file, "w+") as f:
            f.write(texts)
            print(file)

将从每个图像中检索到的 OCR 文本写入与每个图像对应的单独文本文件

Write OCR retrieved text from each image to separate text file corresponding to each image

python

ocr

tesseract