在 BeautifulSoup 中，使用带有 lxml 解析的过滤器的正确方法是什么？

Question

我正在使用 Beautiful Soup 4 和 Python 3.8。我只想解析 HTML 页面中的某些元素，所以我决定使用像这样的过滤器 ...

req = urllib2.Request(full_url, headers=settings.HDR)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, features="lxml", parse_only=DictionaryService.idiom_match_strainer)

,,,

    @staticmethod
    def idiom_match_strainer(elem, attrs):
        if elem == 'ul' and 'class' in attrs and attrs['class'] == 'idiKw':
            return True
        return False

不幸的是，当我尝试解析任何 URL（https://idioms.thefreedictionary.com/testing 是一个示例）时，我收到以下错误

Internal Server Error: /ajax/get_hints
Traceback (most recent call last):
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/base.py", line 126, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/base.py", line 124, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/davea/Documents/workspace/dictionary_project/dictionary/views.py", line 194, in get_hints
    objects = s.get_hints(article)
  File "/Users/davea/Documents/workspace/dictionary_project/dictionary/services/article_service.py", line 398, in get_hints
    idioms = DictionaryService.get_idioms(word)
  File "/Users/davea/Documents/workspace/dictionary_project/dictionary/services/dictionary_service.py", line 75, in get_idioms
    soup = BeautifulSoup(html, features="lxml", parse_only=DictionaryService.idiom_match_strainer)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 281, in __init__
    self._feed()
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 342, in _feed
    self.builder.feed(self.markup)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/builder/_lxml.py", line 287, in feed
    self.parser.feed(markup)
  File "src/lxml/parser.pxi", line 1242, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 1364, in lxml.etree._FeedParser.feed
  File "src/lxml/parsertarget.pxi", line 148, in lxml.etree._TargetParserContext._handleParseResult
  File "src/lxml/parsertarget.pxi", line 136, in lxml.etree._TargetParserContext._handleParseResult
  File "src/lxml/etree.pyx", line 314, in lxml.etree._ExceptionContext._raise_if_stored
  File "src/lxml/saxparser.pxi", line 389, in lxml.etree._handleSaxTargetStartNoNs
  File "src/lxml/saxparser.pxi", line 404, in lxml.etree._callTargetSaxStart
  File "src/lxml/parsertarget.pxi", line 80, in lxml.etree._PythonSaxParserTarget._handleSaxStart
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/builder/_lxml.py", line 220, in start
    self.soup.handle_starttag(name, namespace, nsprefix, attrs)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 582, in handle_starttag
    and (self.parse_only.text
AttributeError: 'function' object has no attribute 'text'

我应该以其他方式使用过滤器吗？

Answer 1

使用包中的 SoupStrainer 就足够了：

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

html = '<html><body><section><ul class="foo"><li>a<li>b</ul><ul><li>1<li>2</ul></section><ul class="foo"><li>c<li>d</ul></body></html>'

soup = BeautifulSoup(html, features="lxml", parse_only=SoupStrainer('ul', class_='foo'))

print(soup.prettify())

给予

<ul class="foo">
 <li>
  a
 </li>
 <li>
  b
 </li>
</ul>
<ul class="foo">
 <li>
  c
 </li>
 <li>
  d
 </li>
</ul>

所以你的电话你想要parse_only=SoupStrainer('ul', class_='idiKw')我想。

在 BeautifulSoup 中，使用带有 lxml 解析的过滤器的正确方法是什么？

In BeautifulSoup, what's the proper way to use a strainer with lxml parsing?

lxml

beautifulsoup

html-parsing

python-3.x