从 tesseract return 值中删除换行符 \n

Question

我有一堆图像，每个图像都对应一个我传递给 Pytesseract 进行识别的名称。有些名字有点长，需要写成多行，所以通过它们进行识别并将它们保存到.txt文件中，导致每个部分都换行。

这是一个例子

这被识别为

MARTHE
MVUMBI

虽然我需要它们在同一行。

另一个例子

它应该是 MOHAMED ASSAD YVES 但它实际上被存储为：

穆罕默德

阿萨德伊夫

我以为我正在过滤这类东西，但显然它不起作用。这是我正在使用的用于识别、存储和过滤的代码。

# Adding custom options
folder = rf"C:\Users\lenovo\PycharmProjects\SoftOCR_PFE\name_results"
custom_config = r'--oem 3 --psm 6'
words = []
filenames = os.listdir(folder)
filenames.sort()
for directory in filenames:
    print(directory)
    for img in glob.glob(rf"name_results\{directory}\*.png"):
        text = pytesseract.image_to_string(img, config=custom_config)
        words.append(text)
    words.append("\n")
all_caps = list([s.strip() for s in words if s == s.upper() and s != 'NOM' and s != 'PRENOM'])

no_blank = list([string for string in all_caps if string != ""])

with open('temp.txt', 'w+') as filehandle:
    for listitem in no_blank:
        filehandle.write(f'{listitem}\n')
uncleanText = open("temp.txt").read()
cleanText = re.sub('[^A-Za-z0-9\s\d]+', '', uncleanText)
open('saved_names.txt', 'w').write(cleanText)

我不得不再次 post 因为我的最后一个问题是 post 深夜问的而且没有得到任何行动。

Answer 1

我会尝试在行后添加：

text = pytesseract.image_to_string(img, config=custom_config)

这一行：

text = text.replace("\n", " ")

更新

还有一个问题。如何在文件中用 , 连接每一行并将它们保存回文件中。可以这样做：

with open("temp.txt", "r") as f:
    names = f.readlines()

names = [n.replace("\n", "") for n in names]
names = [", ".join(names[i:i+2]) for i in range(0, len(names), 2)]

with open("temp.txt", "w") as f:
    f.write("\n".join(names))

从 tesseract return 值中删除换行符 \n

Removing newline \n from tesseract return values

python

ocr

post-processing

python-tesseract