如何将函数内部创建的列表分配给数据框列

Question

我在 VS 代码上使用 python 3.7.4。我创建了一个函数 img_to_text() ，它将参数作为 pdf 文件。此函数创建 PDF 第一页的 JPEG，并使用 pytesseract.image_to_string() 方法从图像中读取字符串。然后搜索此字符串以查找一些名称，如果名称出现在字符串中，则将其附加到列表 main_consultant_name().

由于整个过程的运行时间相当长，我使用多处理来减少运行时间，确实从连续的运行时间减少到 2 分钟258 个 PDF 需要 34 分钟

def img_to_text(file):
    main_consultant_name = []
    pytesseract.pytesseract.tesseract_cmd = r'C:\Users\.....\......\Tesseract-OCR\tesseract.exe'
    pages = convert_from_path("C:/pdfs" + file + '.pdf', 500, last_page= 1)
    for page in pages:
        filename = file +'_Page1.jpg'
        page.save("C:/Users/................" + filename, 'JPEG')
        text = str(((pytesseract.image_to_string(Image.open("C:/Users/............../" + filename))))).lower().replace('\n\n',' ')
        consultant_name = []
        for name in consultant_name_lst:
            if name.lower() in text:
                consultant_name.append(name)
        main_consultant_name.append(consultant_name)
    return main_consultant_name

def process_handler():
    with engine.connect() as conn:
        query1 = "SELECT * FROM pdfs;"
        df1 = pd.read_sql(query1, conn)
    files = [file for file in df1['pdfName']]
    with Pool() as pool:
        results = pool.map(img_to_text, files)
    for result in results:
        print(result)

df1['consultant_name'] = main_consultant_name     # problem is here

我正在尝试在列表 main_consultant_name 的数据框 df1 中添加一列，但我收到一条错误消息 NameError: name 'main_consultant_name' is not defined。我做了一些研究，有点明白，由于 list 是在函数内部定义的，所以不能在函数外部访问它。我尝试全局定义列表，但它不起作用并返回相同的错误消息。

关于我在这里做错了什么有什么想法吗？非常感谢！

Answer 1

main_consultant_name 是您在 img_to_text 函数中创建的局部变量，它在池中的工作进程中运行而在主进程中不存在。您想要的值是结果，它是从池中返回的所有值的列表。我对 SQLAlchemy 不太熟悉，我相信这是您用来访问数据库的正确方法？无论如何，我不知道您是否以适当的方式添加列，我也不确定您是否只想添加列来显示它，或者是否想将值插入到您的数据库中？

哦，顺便说一句，如果你想加快你的程序，如果 pdf 还没有被展平为图像，你可以使用 PyMupdf 之类的东西直接隔离文件中的文本，这样你不必做 OCR。

Answer 2

嗯，解释是因为namespaces和variable scope的概念，总有3个namespaces分别是Built-in、global和local 有时还有一个叫做 Enclosing。简而言之，模块中声明的变量属于global namespace，函数中声明的变量属于local，您可以从local访问global，如下所示：

a = 'Hello'
def testing():
  return a

print(testing()) # Prints 'Hello'

但是您无法从 global 访问到 local，这就是您在代码中尝试做的事情，只是为了向您展示之前的示例：

def testing():
  a = 'Hello'
  return a

print(a)

引发错误：NameError: name 'a' is not defined

所以你可以做的是捕获 img_to_text returns 然后分配给 df1['consultant_name']:

def testing():
  a = 'Hello'
  return a

result = testing()
print(result) # Prints 'Hello'

或使用 global 之类的东西，但不推荐这样做：

a = ''
def testing():
  global a
  a = 'Hello'
  return a

result = testing()
print(result)

希望对您有所帮助:)

如何将函数内部创建的列表分配给数据框列

How to assign list created inside a function to the data frame column

python

function

global-variables

multiprocessing

python-tesseract