使用 pytesseract 的字符串输出在 pandas 数据帧中进行 vlookup
Using string output from pytesseract to do a vlookup in pandas dataframe
我是 Python 的新手,我正在尝试为 BPM 程序制作歌曲标题的简单图像。我的方法是使用 pytesseract 生成字符串输出;然后,使用该字符串输出,我希望在 pandas 创建的数据框中进行 vlookup。但是,它总是 return 零值,即使那首歌确实存在于数据中。
import PIL.ImageGrab
from PIL import ImageGrab
import numpy as np
import pytesseract
import pandas as pd
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
def getTitleImage(left, top, width, height):
printscreen_pil = ImageGrab.grab((left, top, left + width, top + height))
printscreen_numpy = np.array(printscreen_pil.getdata(), dtype='uint8') \
.reshape((printscreen_pil.size[1], printscreen_pil.size[0], 3))
return printscreen_numpy
# Printscreen:
titleImage = getTitleImage(x, y, w, h)
# pytesseract to string:
songTitle = pytesseract.image_to_string(titleImage)
print('Name of the song: ', songTitle)
# Importing the csv data via pandas.
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
# A simple vlookup formula that return the BPM of the song by taking data from the same row.
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
输出:
Name of the song: Macarena
The BPM of the song is: 0
但是,当我试图强制将字符串提供给 songTitle 变量时,它起作用了:
songTitle = 'Macarena'
print('Name of the song: ', songTitle)
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
输出:
Name of the song: Macarena
The BPM of the song is: 103
我检查了pytesseract生成的字符串:前面和后面都没有多余的space,和强制字符串完全一样,但是结果还是不一样。可能是什么问题?
我找到了答案。
这是因为歌名来自:
songTitle = pytesseract.image_to_string(titleImage)
...实际上是 'Macarena\n'
而不是 'Macarena'
。
它们在打印出来后看起来可能是一样的,除了前者会在它之后创建一个新行。
对我来说是一个很好的教训。
我是 Python 的新手,我正在尝试为 BPM 程序制作歌曲标题的简单图像。我的方法是使用 pytesseract 生成字符串输出;然后,使用该字符串输出,我希望在 pandas 创建的数据框中进行 vlookup。但是,它总是 return 零值,即使那首歌确实存在于数据中。
import PIL.ImageGrab
from PIL import ImageGrab
import numpy as np
import pytesseract
import pandas as pd
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
def getTitleImage(left, top, width, height):
printscreen_pil = ImageGrab.grab((left, top, left + width, top + height))
printscreen_numpy = np.array(printscreen_pil.getdata(), dtype='uint8') \
.reshape((printscreen_pil.size[1], printscreen_pil.size[0], 3))
return printscreen_numpy
# Printscreen:
titleImage = getTitleImage(x, y, w, h)
# pytesseract to string:
songTitle = pytesseract.image_to_string(titleImage)
print('Name of the song: ', songTitle)
# Importing the csv data via pandas.
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
# A simple vlookup formula that return the BPM of the song by taking data from the same row.
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
输出:
Name of the song: Macarena
The BPM of the song is: 0
但是,当我试图强制将字符串提供给 songTitle 变量时,它起作用了:
songTitle = 'Macarena'
print('Name of the song: ', songTitle)
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
输出:
Name of the song: Macarena
The BPM of the song is: 103
我检查了pytesseract生成的字符串:前面和后面都没有多余的space,和强制字符串完全一样,但是结果还是不一样。可能是什么问题?
我找到了答案。 这是因为歌名来自:
songTitle = pytesseract.image_to_string(titleImage)
...实际上是 'Macarena\n'
而不是 'Macarena'
。
它们在打印出来后看起来可能是一样的,除了前者会在它之后创建一个新行。
对我来说是一个很好的教训。