使用 pytesseract 的字符串输出在 pandas 数据帧中进行 vlookup

Question

我是 Python 的新手，我正在尝试为 BPM 程序制作歌曲标题的简单图像。我的方法是使用 pytesseract 生成字符串输出；然后，使用该字符串输出，我希望在 pandas 创建的数据框中进行 vlookup。但是，它总是 return 零值，即使那首歌确实存在于数据中。

import PIL.ImageGrab
from PIL import ImageGrab
import numpy as np
import pytesseract
import pandas as pd

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

def getTitleImage(left, top, width, height):
    printscreen_pil = ImageGrab.grab((left, top, left + width, top + height))
    printscreen_numpy = np.array(printscreen_pil.getdata(), dtype='uint8') \
        .reshape((printscreen_pil.size[1], printscreen_pil.size[0], 3))
    return printscreen_numpy

# Printscreen:
titleImage = getTitleImage(x, y, w, h)

# pytesseract to string:
songTitle = pytesseract.image_to_string(titleImage)
print('Name of the song: ', songTitle)

# Importing the csv data via pandas.
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')

# A simple vlookup formula that return the BPM of the song by taking data from the same row.
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)

输出：

Name of the song: Macarena

The BPM of the song is:  0

但是，当我试图强制将字符串提供给 songTitle 变量时，它起作用了：

songTitle = 'Macarena'
print('Name of the song: ', songTitle)
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)

输出：

Name of the song: Macarena

The BPM of the song is:  103

我检查了pytesseract生成的字符串：前面和后面都没有多余的space，和强制字符串完全一样，但是结果还是不一样。可能是什么问题？

Answer 1

我找到了答案。这是因为歌名来自：

songTitle = pytesseract.image_to_string(titleImage)

...实际上是 'Macarena\n' 而不是 'Macarena'。它们在打印出来后看起来可能是一样的，除了前者会在它之后创建一个新行。对我来说是一个很好的教训。

使用 pytesseract 的字符串输出在 pandas 数据帧中进行 vlookup

Using string output from pytesseract to do a vlookup in pandas dataframe

vlookup

dataframe

pandas

python-tesseract