将 lxml 设置为默认 BeautifulSoup 解析器

Question

我正在做一个网络抓取项目，遇到了速度问题。为了尝试修复它，我想使用 lxml 而不是 html.parser 作为 BeautifulSoup 的解析器。我已经能够做到这一点：

soup = bs4.BeautifulSoup(html, 'lxml')

但我不想每次调用 BeautifulSoup 时都必须重复输入 'lxml'。有没有一种方法可以在我的程序开始时设置一次使用哪个解析器？

Answer 1

根据 Specifying the parser to use 文档页面：

The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed.

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

换句话说，只需在相同的 python 环境中安装 lxml 即可使其成为默认解析器。

但请注意，明确说明解析器被认为是最佳实践方法。 differences between parsers 可能会导致细微的错误，如果让 BeautifulSoup 自己选择最佳解析器，这些错误将很难调试。您还必须记住，您需要安装 lxml。而且，如果您不安装它，您甚至不会注意到它 - BeautifulSoup 只会获得下一个可用的解析器而不会引发任何错误。

如果您仍然不想明确指定解析器，至少为将来您自己或其他将使用您在项目 README/documentation 中编写的代码的人做个注释，并列出 lxml 在您的项目要求中与 beautifulsoup4.

一起

此外："Explicit is better than implicit."

Answer 2

显然先看看。挺好的，至于这个技术性：

but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?

如果我正确理解了您的问题，我可以想到两种方法来为您节省一些击键次数：- 定义包装函数，或 - 创建部分函数。

# V1 - define a wrapper function - most straight-forward.
import bs4

def bs_parse(html):
    return bs4.BeautifulSoup(html, 'lxml')
# ...
html = ...
bs_parse(html)

或者如果你想炫耀......

import bs4
from functools import partial
bs_parse = partial(bs4.BeautifulSoup, features='lxml')
# ...
html = ...
bs_parse(html)

将 lxml 设置为默认 BeautifulSoup 解析器

Set lxml as default BeautifulSoup parser

html

python

lxml

beautifulsoup

html-parsing