使用 functools.partial 为 pdfquery 获取属性错误制作自定义过滤器

Question

背景

我正在使用 pdfquery 解析多个文件，例如 this one。

问题

我正在尝试编写一个通用的文件管理器函数，建立在 pdfquery's docs 中提到的自定义选择器的基础上，它可以将特定范围作为参数。因为 this 被引用了，所以我想我可以通过使用 functools.partial 提供部分函数来解决这个问题（如下所示）

输入

import pdfquery
import functools

def load_file(PDF_FILE):
    pdf = pdfquery.PDFQuery(PDF_FILE)
    pdf.load()
    return pdf

file_with_table = 'Path to the file mentioned above'
pdf = load_file(file_with_table)


def elements_in_range(x1_range):
    return in_range(x1_range[0], x1_range[1], float(this.get('x1',0)))

x1_part = functools.partial(elements_in_range, (95,350))

pdf.pq('LTPage[page_index="0"] *').filter(x1_part)

但是当我这样做时，出现以下属性错误；

输出

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in filter(self, selector)
    597                     if len(args) == 1:
--> 598                         func_globals(selector)['this'] = this
    599                     if callback(selector, i, this):

C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in func_globals(f)
     28 def func_globals(f):
---> 29     return f.__globals__ if PY3k else f.func_globals
     30 

AttributeError: 'functools.partial' object has no attribute '__globals__'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-74-d75c2c19f74b> in <module>()
     15 x1_part = functools.partial(elements_in_range, (95,350))
     16 
---> 17 pdf.pq('LTPage[page_index="0"] *').filter(x1_part)

C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in filter(self, selector)
    600                         elements.append(this)
    601             finally:
--> 602                 f_globals = func_globals(selector)
    603                 if 'this' in f_globals:
    604                     del f_globals['this']

C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in func_globals(f)
     27 
     28 def func_globals(f):
---> 29     return f.__globals__ if PY3k else f.func_globals
     30 
     31 

AttributeError: 'functools.partial' object has no attribute '__globals__'

有什么办法可以解决这个问题吗？或者可能是其他一些为 pdfquery 编写可以接受参数的自定义选择器的方法？

Answer 1

如果只使用一个函数来 return 一个新函数（在某种程度上类似于 functools.partial），而不是使用闭包呢？

import pdfquery

def load_file(PDF_FILE):
    pdf = pdfquery.PDFQuery(PDF_FILE)
    pdf.load()
    return pdf

file_with_table = './RG234621_90110.pdf'
pdf = load_file(file_with_table)

def in_range(x1, x2, sample):
    return x1 <= sample <= x2

def in_x_range(bounds):
    def wrapped(*args, **kwargs):
        x = float(this.get('x1', 0))
        return in_range(bounds[0], bounds[1], x)
    return wrapped

def in_y_range(bounds):
    def wrapped(*args, **kwargs):
        y = float(this.get('y1', 0))
        return in_range(bounds[0], bounds[1], y)
    return wrapped


print(len(pdf.pq('LTPage[page_index="0"] *').filter(in_x_range((95, 350))).filter(in_y_range((60, 100)))))

# Or, perhaps easier to read

x_check = in_x_range((95, 350))
y_check = in_y_range((60, 100))

print(len(pdf.pq('LTPage[page_index="0"] *').filter(x_check).filter(y_check)))

输出

1
16 # <-- bounds check is larger for y in this test

您可以事件参数化您正在比较的属性

import pdfquery

def load_file(PDF_FILE):
    pdf = pdfquery.PDFQuery(PDF_FILE)
    pdf.load()
    return pdf

file_with_table = './RG234621_90110.pdf'
pdf = load_file(file_with_table)

def in_range(prop, bounds):
    def wrapped(*args, **kwargs):
        n = float(this.get(prop, 0))
        return bounds[0] <= n <= bounds[1]
    return wrapped


print(len(pdf.pq('LTPage[page_index="0"] *').filter(in_range('x1', (95, 350))).filter(in_range('y1', (60, 100)))))

x_check = in_range('x1', (95, 350))
y_check = in_range('y1', (40, 100))

print(len(pdf.pq('LTPage[page_index="0"] *').filter(x_check).filter(y_check)))

我还建议使用 parse_tree_cacher 参数，因为这加快了我找到合适解决方案的时间（尽管您可能不需要像我在解决这个问题时那样经常重新处理）。

import pdfquery
from pdfquery.cache import FileCache

def load_file(PDF_FILE):
    pdf = pdfquery.PDFQuery(PDF_FILE, parse_tree_cacher=FileCache("/tmp/"))
    pdf.load()
    return pdf

Answer 2

尽管我喜欢闭包方法，但我真的应该提一下，您可以将包装函数中的属性复制到包装器中。

from functools import update_wrapper

custom_filter = update_wrapper(
    partial(
        elements_in_range, (95, 20)
    ),
    wrapped=elements_in_range,
    assigned=('__globals__', '__code__')
)

使用 functools.partial 为 pdfquery 获取属性错误制作自定义过滤器

Using functools.partial to make custom filters for pdfquery getting attribute error

python

pdf

pdf-parsing

python-3.x

functools

背景

问题

输入

输出