tesseract Image.open 用于多页文件
tesseract Image.open for multi page files
此代码访问包含单页 .tif 文件的文件夹并提取文本数据。
data = []
data1 = []
listOfPages = glob.glob(r"C:/Users/name/folder/*.tif")
for entry in listOfPages:
if os.path.isfile(entry):
filenames = entry
data1.append(filenames)
text1 = pytesseract.image_to_string(
Image.open(entry), lang="en"
)
text = re.sub(r'\n',' ', text1)
regex1 = re.compile(r'(www(i|ı)a\s+bbb(\:)?(\s+|\s+\.)?\s+(de(s|r(:)?))?)', flags = re.IGNORECASE)
try:
var1a = regex1.search(text)
if var1a:
var1 = var1a.group(1)
else:
var1 = None
except:
pass
data.append([text, var1])
df0 = pd.DataFrame(data, columns =['raw_text', 'var1'])
df01= pd.DataFrame(data1,columns =['filename'])
df1 = pd.concat([df0, df01], axis=1)
df1 = df1.reset_index(drop = True)
如果我最终将 multipage.tif 文件添加到该文件夹中,我该如何让它工作?我无法将 Image.opn(entry)
部分转换成这样的内容:
img = Image.open(path)
images = []
for i in range(img.n_frames):
img.seek(i)
images.append(np.array(img))
return np.array(images)
- 您可以将
np.array
传递给 image_to_string
方法。 Pytesseract 将自行处理(参见 https://github.com/madmaze/pytesseract/blob/master/pytesseract/pytesseract.py#L168)
text1 = pytesseract.image_to_string(np.array(img), lang="en")
- 或者在将图像传递给 pytesseract 之前从数组而不是文件创建图像:
text1 = pytesseract.image_to_string(Image.fromarray(np.array(img)), lang="en")
这是一个完整的例子(没有循环和进一步的处理):
import numpy as np
from PIL import Image
import pytesseract
tif = Image.open('multipage_tif_example.tif')
tif.seek(0)
img_page1 = np.array(tif)
# Variant 1
text1 = pytesseract.image_to_string(img_page1, lang="eng")
# Variant 2
text1 = pytesseract.image_to_string(Image.fromarray(img_page1), lang="eng")
我使用的版本是:
- Python 3.9.7
- pytesseract==0.3.8
- numpy==1.21.3
- 枕头==8.4.0
争吵来自http://www.nightprogrammer.org/development/multipage-tiff-example-download-test-image-file/
此代码访问包含单页 .tif 文件的文件夹并提取文本数据。
data = []
data1 = []
listOfPages = glob.glob(r"C:/Users/name/folder/*.tif")
for entry in listOfPages:
if os.path.isfile(entry):
filenames = entry
data1.append(filenames)
text1 = pytesseract.image_to_string(
Image.open(entry), lang="en"
)
text = re.sub(r'\n',' ', text1)
regex1 = re.compile(r'(www(i|ı)a\s+bbb(\:)?(\s+|\s+\.)?\s+(de(s|r(:)?))?)', flags = re.IGNORECASE)
try:
var1a = regex1.search(text)
if var1a:
var1 = var1a.group(1)
else:
var1 = None
except:
pass
data.append([text, var1])
df0 = pd.DataFrame(data, columns =['raw_text', 'var1'])
df01= pd.DataFrame(data1,columns =['filename'])
df1 = pd.concat([df0, df01], axis=1)
df1 = df1.reset_index(drop = True)
如果我最终将 multipage.tif 文件添加到该文件夹中,我该如何让它工作?我无法将 Image.opn(entry)
部分转换成这样的内容:
img = Image.open(path)
images = []
for i in range(img.n_frames):
img.seek(i)
images.append(np.array(img))
return np.array(images)
- 您可以将
np.array
传递给image_to_string
方法。 Pytesseract 将自行处理(参见 https://github.com/madmaze/pytesseract/blob/master/pytesseract/pytesseract.py#L168)
text1 = pytesseract.image_to_string(np.array(img), lang="en")
- 或者在将图像传递给 pytesseract 之前从数组而不是文件创建图像:
text1 = pytesseract.image_to_string(Image.fromarray(np.array(img)), lang="en")
这是一个完整的例子(没有循环和进一步的处理):
import numpy as np
from PIL import Image
import pytesseract
tif = Image.open('multipage_tif_example.tif')
tif.seek(0)
img_page1 = np.array(tif)
# Variant 1
text1 = pytesseract.image_to_string(img_page1, lang="eng")
# Variant 2
text1 = pytesseract.image_to_string(Image.fromarray(img_page1), lang="eng")
我使用的版本是:
- Python 3.9.7
- pytesseract==0.3.8
- numpy==1.21.3
- 枕头==8.4.0
争吵来自http://www.nightprogrammer.org/development/multipage-tiff-example-download-test-image-file/