storing image (.tif) in np.array through PIL fromarray [TypeError: Cannot handle this data type][ValueError: invalid literal for int()]
storing image (.tif) in np.array through PIL fromarray [TypeError: Cannot handle this data type][ValueError: invalid literal for int()]
更新
当我写的时候(正如回答者所说)
with open('results.txt', 'a', encoding="utf-8") as f:
for line in results:
f.write(line)
f.write('\n')
所有文本片段都正确附加到 result.txt
。但是当我进入 cmd 并做
magick -density 288 text:"result.txt" -alpha off -compress Group4 filename1.tif
它用所有 result.txt
个字符创建 filename1.tif
作为图片。
原问题:
此代码访问单页 .tif 文件的文件夹并提取文本数据。
data = []
data1 = []
listOfPages = glob.glob(r"C:/Users/name/folder/*.tif")
for entry in listOfPages:
if os.path.isfile(entry):
filenames = entry
data1.append(filenames)
text1 = pytesseract.image_to_string(
Image.open(entry), lang="en"
)
text = re.sub(r'\n',' ', text1)
regex1 = re.compile(r'(www(i|ı)a\s+bbb(\:)?(\s+|\s+\.)?\s+(de(s|r(:)?))?)', flags = re.IGNORECASE)
try:
var1a = regex1.search(text)
if var1a:
var1 = var1a.group(1)
else:
var1 = None
except:
pass
data.append([text, var1])
df0 = pd.DataFrame(data, columns =['raw_text', 'var1'])
df01= pd.DataFrame(data1,columns =['filename'])
df1 = pd.concat([df0, df01], axis=1)
我想调整它以便也适用于多页文件。因此我试图通过 Image.fromarray() 转换它,这会引发以下错误:
text1 = pytesseract.image_to_string(np.array(entry), lang="en")
要么
text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry)), lang="en")
TypeError: Cannot handle this data type: (1, 1), <U52
我用python 3.9.7 pytesseract 0.3.8 numpy 1.21.2 pillow 8.3.2
我读了这个
想出了这个
text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry * 255).astype(np.uint8)), lang="en")
这给我错误:ValueError: invalid literal for int() with base 10: 'C:/Users/name/folder/test\fff.tifC:/Users/name/folder/test\ddddd.tif
暗示需要使用 float
但是当我这样做的时候
text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry * 255).astype(np.float)), lang="en")
我得到
ValueError: could not convert string to float: 'C:/Users/name/folder/test\ffff.tif
exiftool 输出
File Type : TIFF
File Type Extension : tif
MIME Type : image/tiff
Exif Byte Order : Little-endian (Intel, II)
Subfile Type : Full-resolution image
Image Width : 2472
Image Height : 3495
Bits Per Sample : 1
Compression : T6/Group 4 Fax
Photometric Interpretation : BlackIsZero
Thresholding : No dithering or halftoning
Fill Order : Reversed
Image Description : DN31
Camera Model Name : SCA
Strip Offsets : (Binary data 90 bytes, use -b option to extract)
Orientation : Horizontal (normal)
Samples Per Pixel : 1
Rows Per Strip : 213
Strip Byte Counts : (Binary data 73 bytes, use -b option to extract)
X Resolution : 300
Y Resolution : 300
Planar Configuration : Chunky
T6 Options : (none)
Resolution Unit : inches
Software : DACS Toolkit II
Modify Date : 1998:03:12 10:29:31
Image Size : 2472x3495
Megapixels : 8.6
关于 SO 的其他建议是
im = Image.fromarray((img[0] * 255).astype(np.uint8))
如果你的图像是灰度的,你需要给PIL传递一个二维数组,即形状必须是h,w而不是h,w,1。
i = Image.open('image.png').convert('RGB')
a = np.asarray(i, np.uint8)
print(a.shape)
b = abs(np.fft.rfft2(a,axes=(0,1)))
b = np.uint8(b)
j = Image.fromarray(b)
默认情况下,它使用最后两个轴:axes=(-2,-1)。第三个轴表示 RGB 通道。相反,似乎更合理的是,人们想要在空间轴上执行 FFT,axes=(0,1)
img = Image.fromarray(data[0][i].transpose(0,2).numpy().astype(np.uint8))
通道维度将是最后一个(而不是第一个)
我认为您需要更像这样的东西来处理多页 TIFF。我已尝试改进您的变量名称,将其从 data
、var
等不起眼的名称改进为使其更具可读性。
#!/usr/bin/env python3
import re
from glob import glob
import pytesseract
from PIL import Image, ImageSequence
def processPage(filename, pageNum, im):
global results
print(f'Processing: {filename}, page: {pageNum}')
text = pytesseract.image_to_string(im, lang="eng")
srchResult = regex.search(text)
if srchResult is not None:
results.append(srchResult.group(0))
# Compile regex just once, outside loop - it doesn't change
regex = re.compile(r'(\w+\s(Queen|President|Washington|London|security|architect)\s\w+)', flags = re.IGNORECASE)
results = []
# Get list of all filenames to be processed
filenames = glob('folder/*.tif')
# Iterate over all files
for filename in filenames:
print(f'Processing file: {filename}')
with Image.open(filename) as im:
for pageNum, page in enumerate(ImageSequence.Iterator(im)):
processPage(filename, pageNum, page)
print('\n'.join(results))
你不需要做剩下的这些事情......它就在那里,所以你可以看到我如何生成 TIFF 来测试...
我通过 “白金汉宫” 和 “白宫” 的维基百科条目制作了 2 个多页 TIFF 来测试它使用 ImageMagick 转到每个页面,将文本复制并保存为 a.txt
然后执行:
magick -density 288 text:"a.txt" -alpha off -compress Group4 WhiteHouse.tif
示例输出
Processing file: folder/Buckingham.tif
Processing: folder/Buckingham.tif, page: 0
Processing: folder/Buckingham.tif, page: 1
Processing: folder/Buckingham.tif, page: 2
Processing file: folder/WhiteHouse.tif
Processing: folder/WhiteHouse.tif, page: 0
Processing: folder/WhiteHouse.tif, page: 1
Processing: folder/WhiteHouse.tif, page: 2
Processing: folder/WhiteHouse.tif, page: 3
Processing: folder/WhiteHouse.tif, page: 4
Processing: folder/WhiteHouse.tif, page: 5
the London residence
stricken Queen withdrew
the president of
George Washington occupied
President Washington
by architect Frederick
in Washington when
House security breaches
更新 当我写的时候(正如回答者所说)
with open('results.txt', 'a', encoding="utf-8") as f:
for line in results:
f.write(line)
f.write('\n')
所有文本片段都正确附加到 result.txt
。但是当我进入 cmd 并做
magick -density 288 text:"result.txt" -alpha off -compress Group4 filename1.tif
它用所有 result.txt
个字符创建 filename1.tif
作为图片。
原问题: 此代码访问单页 .tif 文件的文件夹并提取文本数据。
data = []
data1 = []
listOfPages = glob.glob(r"C:/Users/name/folder/*.tif")
for entry in listOfPages:
if os.path.isfile(entry):
filenames = entry
data1.append(filenames)
text1 = pytesseract.image_to_string(
Image.open(entry), lang="en"
)
text = re.sub(r'\n',' ', text1)
regex1 = re.compile(r'(www(i|ı)a\s+bbb(\:)?(\s+|\s+\.)?\s+(de(s|r(:)?))?)', flags = re.IGNORECASE)
try:
var1a = regex1.search(text)
if var1a:
var1 = var1a.group(1)
else:
var1 = None
except:
pass
data.append([text, var1])
df0 = pd.DataFrame(data, columns =['raw_text', 'var1'])
df01= pd.DataFrame(data1,columns =['filename'])
df1 = pd.concat([df0, df01], axis=1)
我想调整它以便也适用于多页文件。因此我试图通过 Image.fromarray() 转换它,这会引发以下错误:
text1 = pytesseract.image_to_string(np.array(entry), lang="en")
要么
text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry)), lang="en")
TypeError: Cannot handle this data type: (1, 1), <U52
我用python 3.9.7 pytesseract 0.3.8 numpy 1.21.2 pillow 8.3.2
我读了这个
text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry * 255).astype(np.uint8)), lang="en")
这给我错误:ValueError: invalid literal for int() with base 10: 'C:/Users/name/folder/test\fff.tifC:/Users/name/folder/test\ddddd.tif
暗示需要使用 float
但是当我这样做的时候
text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry * 255).astype(np.float)), lang="en")
我得到
ValueError: could not convert string to float: 'C:/Users/name/folder/test\ffff.tif
exiftool 输出
File Type : TIFF
File Type Extension : tif
MIME Type : image/tiff
Exif Byte Order : Little-endian (Intel, II)
Subfile Type : Full-resolution image
Image Width : 2472
Image Height : 3495
Bits Per Sample : 1
Compression : T6/Group 4 Fax
Photometric Interpretation : BlackIsZero
Thresholding : No dithering or halftoning
Fill Order : Reversed
Image Description : DN31
Camera Model Name : SCA
Strip Offsets : (Binary data 90 bytes, use -b option to extract)
Orientation : Horizontal (normal)
Samples Per Pixel : 1
Rows Per Strip : 213
Strip Byte Counts : (Binary data 73 bytes, use -b option to extract)
X Resolution : 300
Y Resolution : 300
Planar Configuration : Chunky
T6 Options : (none)
Resolution Unit : inches
Software : DACS Toolkit II
Modify Date : 1998:03:12 10:29:31
Image Size : 2472x3495
Megapixels : 8.6
关于 SO 的其他建议是
im = Image.fromarray((img[0] * 255).astype(np.uint8))
如果你的图像是灰度的,你需要给PIL传递一个二维数组,即形状必须是h,w而不是h,w,1。
i = Image.open('image.png').convert('RGB')
a = np.asarray(i, np.uint8)
print(a.shape)
b = abs(np.fft.rfft2(a,axes=(0,1)))
b = np.uint8(b)
j = Image.fromarray(b)
默认情况下,它使用最后两个轴:axes=(-2,-1)。第三个轴表示 RGB 通道。相反,似乎更合理的是,人们想要在空间轴上执行 FFT,axes=(0,1)
img = Image.fromarray(data[0][i].transpose(0,2).numpy().astype(np.uint8))
通道维度将是最后一个(而不是第一个)
我认为您需要更像这样的东西来处理多页 TIFF。我已尝试改进您的变量名称,将其从 data
、var
等不起眼的名称改进为使其更具可读性。
#!/usr/bin/env python3
import re
from glob import glob
import pytesseract
from PIL import Image, ImageSequence
def processPage(filename, pageNum, im):
global results
print(f'Processing: {filename}, page: {pageNum}')
text = pytesseract.image_to_string(im, lang="eng")
srchResult = regex.search(text)
if srchResult is not None:
results.append(srchResult.group(0))
# Compile regex just once, outside loop - it doesn't change
regex = re.compile(r'(\w+\s(Queen|President|Washington|London|security|architect)\s\w+)', flags = re.IGNORECASE)
results = []
# Get list of all filenames to be processed
filenames = glob('folder/*.tif')
# Iterate over all files
for filename in filenames:
print(f'Processing file: {filename}')
with Image.open(filename) as im:
for pageNum, page in enumerate(ImageSequence.Iterator(im)):
processPage(filename, pageNum, page)
print('\n'.join(results))
你不需要做剩下的这些事情......它就在那里,所以你可以看到我如何生成 TIFF 来测试...
我通过 “白金汉宫” 和 “白宫” 的维基百科条目制作了 2 个多页 TIFF 来测试它使用 ImageMagick 转到每个页面,将文本复制并保存为 a.txt
然后执行:
magick -density 288 text:"a.txt" -alpha off -compress Group4 WhiteHouse.tif
示例输出
Processing file: folder/Buckingham.tif
Processing: folder/Buckingham.tif, page: 0
Processing: folder/Buckingham.tif, page: 1
Processing: folder/Buckingham.tif, page: 2
Processing file: folder/WhiteHouse.tif
Processing: folder/WhiteHouse.tif, page: 0
Processing: folder/WhiteHouse.tif, page: 1
Processing: folder/WhiteHouse.tif, page: 2
Processing: folder/WhiteHouse.tif, page: 3
Processing: folder/WhiteHouse.tif, page: 4
Processing: folder/WhiteHouse.tif, page: 5
the London residence
stricken Queen withdrew
the president of
George Washington occupied
President Washington
by architect Frederick
in Washington when
House security breaches