storing image (.tif) in np.array through PIL fromarray [TypeError: Cannot handle this data type][ValueError: invalid literal for int()]

storing image (.tif) in np.array through PIL fromarray [TypeError: Cannot handle this data type][ValueError: invalid literal for int()]

更新 当我写的时候(正如回答者所说)

with open('results.txt', 'a', encoding="utf-8") as f:
    for line in results:
        f.write(line)
        f.write('\n')

所有文本片段都正确附加到 result.txt。但是当我进入 cmd 并做 magick -density 288 text:"result.txt" -alpha off -compress Group4 filename1.tif 它用所有 result.txt 个字符创建 filename1.tif 作为图片。


原问题: 此代码访问单页 .tif 文件的文件夹并提取文本数据。

data = []
data1 = []
listOfPages = glob.glob(r"C:/Users/name/folder/*.tif")
for entry in listOfPages:
    if os.path.isfile(entry):
        filenames = entry
    data1.append(filenames)
    text1 = pytesseract.image_to_string(
            Image.open(entry), lang="en"
        )
    text = re.sub(r'\n',' ', text1)     
    regex1 = re.compile(r'(www(i|ı)a\s+bbb(\:)?(\s+|\s+\.)?\s+(de(s|r(:)?))?)', flags = re.IGNORECASE)
        
    try:
        var1a = regex1.search(text)
        if var1a:
            var1 = var1a.group(1)
        else:
            var1 = None
    except:
        pass
        
    data.append([text, var1])
    
df0 = pd.DataFrame(data, columns =['raw_text', 'var1'])
df01= pd.DataFrame(data1,columns =['filename'])
df1 = pd.concat([df0, df01], axis=1)

我想调整它以便也适用于多页文件。因此我试图通过 Image.fromarray() 转换它,这会引发以下错误:

text1 = pytesseract.image_to_string(np.array(entry), lang="en") 要么 text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry)), lang="en")

TypeError: Cannot handle this data type: (1, 1), <U52

我用python 3.9.7 pytesseract 0.3.8 numpy 1.21.2 pillow 8.3.2 我读了这个 想出了这个

text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry * 255).astype(np.uint8)), lang="en")

这给我错误:ValueError: invalid literal for int() with base 10: 'C:/Users/name/folder/test\fff.tifC:/Users/name/folder/test\ddddd.tif

暗示需要使用 float

但是当我这样做的时候

text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry * 255).astype(np.float)), lang="en")

我得到

ValueError: could not convert string to float: 'C:/Users/name/folder/test\ffff.tif

exiftool 输出

File Type                       : TIFF
File Type Extension             : tif
MIME Type                       : image/tiff
Exif Byte Order                 : Little-endian (Intel, II)
Subfile Type                    : Full-resolution image
Image Width                     : 2472
Image Height                    : 3495
Bits Per Sample                 : 1
Compression                     : T6/Group 4 Fax
Photometric Interpretation      : BlackIsZero
Thresholding                    : No dithering or halftoning
Fill Order                      : Reversed
Image Description               : DN31
Camera Model Name               : SCA
Strip Offsets                   : (Binary data 90 bytes, use -b option to extract)
Orientation                     : Horizontal (normal)
Samples Per Pixel               : 1
Rows Per Strip                  : 213
Strip Byte Counts               : (Binary data 73 bytes, use -b option to extract)
X Resolution                    : 300
Y Resolution                    : 300
Planar Configuration            : Chunky
T6 Options                      : (none)
Resolution Unit                 : inches
Software                        : DACS Toolkit II
Modify Date                     : 1998:03:12 10:29:31
Image Size                      : 2472x3495
Megapixels                      : 8.6

关于 SO 的其他建议是

im = Image.fromarray((img[0] * 255).astype(np.uint8))如果你的图像是灰度的,你需要给PIL传递一个二维数组,即形状必须是h,w而不是h,w,1。

i = Image.open('image.png').convert('RGB')
a = np.asarray(i, np.uint8)
print(a.shape)

b = abs(np.fft.rfft2(a,axes=(0,1)))
b = np.uint8(b)
j = Image.fromarray(b)

默认情况下,它使用最后两个轴:axes=(-2,-1)。第三个轴表示 RGB 通道。相反,似乎更合理的是,人们想要在空间轴上执行 FFT,axes=(0,1)

img = Image.fromarray(data[0][i].transpose(0,2).numpy().astype(np.uint8)) 通道维度将是最后一个(而不是第一个)

我认为您需要更像这样的东西来处理多页 TIFF。我已尝试改进您的变量名称,将其从 datavar 等不起眼的名称改进为使其更具可读性。

#!/usr/bin/env python3

import re
from glob import glob
import pytesseract
from PIL import Image, ImageSequence

def processPage(filename, pageNum, im):
    global results
    print(f'Processing: {filename}, page: {pageNum}')

    text = pytesseract.image_to_string(im, lang="eng")
    srchResult = regex.search(text)
    if srchResult is not None:
        results.append(srchResult.group(0))

# Compile regex just once, outside loop - it doesn't change
regex = re.compile(r'(\w+\s(Queen|President|Washington|London|security|architect)\s\w+)', flags = re.IGNORECASE)

results = []

# Get list of all filenames to be processed
filenames = glob('folder/*.tif')

# Iterate over all files
for filename in filenames:
    print(f'Processing file: {filename}')
    with Image.open(filename) as im:
        for pageNum, page in enumerate(ImageSequence.Iterator(im)):
                processPage(filename, pageNum, page)

print('\n'.join(results))

你不需要做剩下的这些事情......它就在那里,所以你可以看到我如何生成 TIFF 来测试...

我通过 “白金汉宫”“白宫” 的维基百科条目制作了 2 个多页 TIFF 来测试它使用 ImageMagick 转到每个页面,将文本复制并保存为 a.txt 然后执行:

magick -density 288 text:"a.txt" -alpha off -compress Group4 WhiteHouse.tif

示例输出

Processing file: folder/Buckingham.tif
Processing: folder/Buckingham.tif, page: 0
Processing: folder/Buckingham.tif, page: 1
Processing: folder/Buckingham.tif, page: 2
Processing file: folder/WhiteHouse.tif
Processing: folder/WhiteHouse.tif, page: 0
Processing: folder/WhiteHouse.tif, page: 1
Processing: folder/WhiteHouse.tif, page: 2
Processing: folder/WhiteHouse.tif, page: 3
Processing: folder/WhiteHouse.tif, page: 4
Processing: folder/WhiteHouse.tif, page: 5
the London residence
stricken Queen withdrew
the president of
George Washington occupied
President Washington
by architect Frederick
in Washington when
House security breaches