HTML 使用 BeautifulSoup 解析未按预期工作

Question

我正在使用 Python 3 和 BeautifulSoup 模块，版本 4.9.3。我正在尝试使用这个包来练习解析一些简单的 HTML.

我的字符串如下：

text = '''<li><p>Some text</p>is put here</li><li><p>And other text is put here</p></li>'''

我使用BeautifulSoup如下：

x = BeautifulSoup(text, "html.parser")

然后我使用以下脚本试验 Beautiful Soup 的功能：

for li in x.find_all('li'):
    print(li)
    print(li.string)
    print(li.next_element)
    print(li.next_element)
    print(li.next_element.string)
    print("\n")

结果（至少第一次迭代是这样）出乎意料：

<li><p>Some text</p>is put here</li>
None
<p>Some text</p>
Some text


<li><p>And other text is here</p></li>
And other text is here
<p>And other text is here</p>
And other text is here

为什么第一个li标签的string属性是None，而内部p标签的string属性不是None?

同样，如果我这样做：

x.find_all('li', string=re.compile('text'))

我只得到一个结果（第二个标签）。

但如果我这样做：

for li in x.find_all('li'):
    print(li.find_all(string=re.compile('text')))

我得到 2 个结果（两个标签）。

Answer 1

释义the doc：

If a tag has only one child, and that child is a NavigableString, the child is made available as .string.

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child.

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None.

让我们将这些规则应用于您的问题：

Why is the string attribute of the first li tag None, whereas the string attribute of the inner p tag is not None?

内部 p 标签满足规则 #1；它只有一个 child，而 child 是一个 NavigableString，所以 .string returns 即 child.

第一个 li 满足规则 #3；它有多个 child，因此 .string 会产生歧义。

考虑到你的第二个问题，我们来咨询一下the doc for the string= argument to .find_all()

With string you can search for strings instead of tags. ... Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string.

你的第一个例子：

x.find_all('li', string=re.compile('text'))
# [<li><p>And other text is put here</p></li>]

搜索 .string 匹配正则表达式的所有 li 标签。但是我们已经看到第一个li的.string是None，所以不匹配

你的第二个例子：

for li in x.find_all('li'):
    print(li.find_all(string=re.compile('text')))
# ['Some text']
# ['And other text is put here']

这将搜索每个 li 树中任何位置包含的所有字符串。对于第一棵树，li.p.string 存在并匹配，即使 li.string 不存在。

HTML 使用 BeautifulSoup 解析未按预期工作

HTML parsing not working as expected using BeautifulSoup

html

python

parsing

beautifulsoup