pytesseract 输出文件的格式问题
Format issue with the output file of pytesseract
我正在尝试使用 pytesseract 从图像中提取文本。
我希望输出文件的格式与正在处理的图像的格式相同。
格式是指输出文本作为输入图像按行和列排列。
我尝试了以下代码,但输出文件看起来与输入文件完全不同,但文本识别大部分是准确的。
代码
import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng'
d = pytesseract.image_to_data(Image.open(r'_0.png'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num']==block]
sel = curr[curr.text.str.len()>3]
char_w = (sel.width/sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # num of spaces that should be added
if ln['left']/char_w > prev_left + 1:
added = int((ln['left'])/char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
输入图片
输出
首先 - 去除噪音 -> 它会产生额外的错误。
接下来尝试不同的输出。例如hocr 是 html/xml 带有边界框信息的输出,因此您可以在屏幕上获得 OCR 结果的准确位置。
如果您不需要精确的位置,对 txt 输出进行后期处理可能会更容易。例如。 tesseract 5 和 tessdata_best 产生这个输出
$ tesseract YaVQ3.jpg - --psm 6 --dpi 300 -c preserve_interword_spaces=1
2
wf
10020 Knut Bratli, Brandval P.b. Chrysler 1936
10033 Erland Berg, Gjes&sen P.b. Dodge 1939
10054 Edvart Sandmo, Gardvik P.b. Opel 1937
10057 Hjalmar Aanerud, Vinger P.b. Opel 1932
10075 Reidar Holth, Flisa P.b. Volvo . 1960
10076 Einar Bredalen, Braskereidfoss P.b. Dodge 1929
10077 Reidar Holth, Flisa P.b. Volkswagen 1961
10089 Sor-Odal Bulldozerdrift, Skarnes Lb. White 1944 "
10090 Arne Radford, Galterud Lb. Ford 1939
10093 Sverre Langbraten, Brandval L.b. Citroén 1950
10096 Karl Tuhus, Skotterud P.b. Chrysler 1936
10101 Gunnar Bie-Larsen, Kongsvinger P.b. Ford : ©1961
10110 Martin Albertsen, Flisa Pb. Opel . 1960
10111 Alf @degaard, Kongsvinger P.b. Volkswagen 1958
10112 Asbjern Elverhoi, Kongsvinger Pb. Ford 1961
10114 Olav Sunde jr., Skarnes ¢ P.b. Plymouth 1937
10116 John Erichsen, Skarnes P.b. Ford 1960
10118 Ole Hasleengen, Véler \ Pb. Morris 1931
10120 Harald Eggen, Vinger \ P.b. Peugeot 1938
10121 Ola N. Berg, Gjesisen Pb. Ford 1960
10125 Reldar Rapstad, Roverud Pb. Ford 1954 Pp
10129 Erling Johnsrud, Skarnes Pb. Overland 1939
10130 Reidar Vangen, Disend P.b. Hudson 1947 v
10133 Oddvar Lilleseth, Skarnes V.b. Ford 1934
10136 Hans K. Kolbjornsrud, Austmarka P.b. Volvo 1939
10140 Rolv Snare, Kongsvinger P.h. Mercedes Benz 1950
10143 Olaf Storberget, Grue Finnskog L.b. Land Rover 1951
10146 Helge Strand, Magnor P.b. Hudson 1946
10148 Arne Hagan, Brandval Pb. Volkswagen’ 1957
10159 Brodbelfoss, E.verk, Vinger P.b. Chevrolet 1939
10160 Lauritz Hove, Sander Pb. Ford 1959
10161 Rolf Johnsen, Matrand Lb. Ford * 1937
10168 Sten Sooth Knutsen, Skotterud Pb. Volkswagen 1962
10170 Odd Norli, Knapper P.b. Buick 1938
10175 Gustav Solvang, Kongsvinger L.b. Chevrolet 1939 4
10180 Trygve Wolden, Kongsvinger Pb. Dodge 1920
10182 Kongsv. Handelsgartneri, Kongsv. Stb. Opel 1957
10186 Oddvar Berget, Namni Lb. Fordson 1933
10188 Sander Idrettslag, Sander . Buss Austin +1951
10185 Karl O. Halvorsen, Br.foss L.b. Hanomag 1955
NN -
3
: ll
v -—
我正在尝试使用 pytesseract 从图像中提取文本。
我希望输出文件的格式与正在处理的图像的格式相同。
格式是指输出文本作为输入图像按行和列排列。
我尝试了以下代码,但输出文件看起来与输入文件完全不同,但文本识别大部分是准确的。
代码
import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng'
d = pytesseract.image_to_data(Image.open(r'_0.png'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num']==block]
sel = curr[curr.text.str.len()>3]
char_w = (sel.width/sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # num of spaces that should be added
if ln['left']/char_w > prev_left + 1:
added = int((ln['left'])/char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
输入图片
输出
首先 - 去除噪音 -> 它会产生额外的错误。
接下来尝试不同的输出。例如hocr 是 html/xml 带有边界框信息的输出,因此您可以在屏幕上获得 OCR 结果的准确位置。
如果您不需要精确的位置,对 txt 输出进行后期处理可能会更容易。例如。 tesseract 5 和 tessdata_best 产生这个输出
$ tesseract YaVQ3.jpg - --psm 6 --dpi 300 -c preserve_interword_spaces=1
2
wf
10020 Knut Bratli, Brandval P.b. Chrysler 1936
10033 Erland Berg, Gjes&sen P.b. Dodge 1939
10054 Edvart Sandmo, Gardvik P.b. Opel 1937
10057 Hjalmar Aanerud, Vinger P.b. Opel 1932
10075 Reidar Holth, Flisa P.b. Volvo . 1960
10076 Einar Bredalen, Braskereidfoss P.b. Dodge 1929
10077 Reidar Holth, Flisa P.b. Volkswagen 1961
10089 Sor-Odal Bulldozerdrift, Skarnes Lb. White 1944 "
10090 Arne Radford, Galterud Lb. Ford 1939
10093 Sverre Langbraten, Brandval L.b. Citroén 1950
10096 Karl Tuhus, Skotterud P.b. Chrysler 1936
10101 Gunnar Bie-Larsen, Kongsvinger P.b. Ford : ©1961
10110 Martin Albertsen, Flisa Pb. Opel . 1960
10111 Alf @degaard, Kongsvinger P.b. Volkswagen 1958
10112 Asbjern Elverhoi, Kongsvinger Pb. Ford 1961
10114 Olav Sunde jr., Skarnes ¢ P.b. Plymouth 1937
10116 John Erichsen, Skarnes P.b. Ford 1960
10118 Ole Hasleengen, Véler \ Pb. Morris 1931
10120 Harald Eggen, Vinger \ P.b. Peugeot 1938
10121 Ola N. Berg, Gjesisen Pb. Ford 1960
10125 Reldar Rapstad, Roverud Pb. Ford 1954 Pp
10129 Erling Johnsrud, Skarnes Pb. Overland 1939
10130 Reidar Vangen, Disend P.b. Hudson 1947 v
10133 Oddvar Lilleseth, Skarnes V.b. Ford 1934
10136 Hans K. Kolbjornsrud, Austmarka P.b. Volvo 1939
10140 Rolv Snare, Kongsvinger P.h. Mercedes Benz 1950
10143 Olaf Storberget, Grue Finnskog L.b. Land Rover 1951
10146 Helge Strand, Magnor P.b. Hudson 1946
10148 Arne Hagan, Brandval Pb. Volkswagen’ 1957
10159 Brodbelfoss, E.verk, Vinger P.b. Chevrolet 1939
10160 Lauritz Hove, Sander Pb. Ford 1959
10161 Rolf Johnsen, Matrand Lb. Ford * 1937
10168 Sten Sooth Knutsen, Skotterud Pb. Volkswagen 1962
10170 Odd Norli, Knapper P.b. Buick 1938
10175 Gustav Solvang, Kongsvinger L.b. Chevrolet 1939 4
10180 Trygve Wolden, Kongsvinger Pb. Dodge 1920
10182 Kongsv. Handelsgartneri, Kongsv. Stb. Opel 1957
10186 Oddvar Berget, Namni Lb. Fordson 1933
10188 Sander Idrettslag, Sander . Buss Austin +1951
10185 Karl O. Halvorsen, Br.foss L.b. Hanomag 1955
NN -
3
: ll
v -—