为什么 lxml 不剥离部分标签？

Question

我正在尝试用 lxml 和 Python 解析一些 HTML。我想删除部分标签。 lxml 似乎能够删除我指定的所有其他标签，但不能删除部分标签。

例如

test_html = '<section> <header> Test header </header> <p> Test text </p> </section>'
to_parse_html = etree.fromstring(test_html)

etree.strip_tags(to_parse_html,'header')
etree.tostring(to_parse_html)

'<section>  Test header  <p> Test text </p> </section>'

etree.strip_tags(to_parse_html,'p')
etree.tostring(to_parse_html)
'<section>  Test header   Test text  </section>'

etree.strip_tags(to_parse_html,'section')
etree.tostring(to_parse_html)
'<section>  Test header   Test text  </section>'

为什么会这样？

Answer 1

Why is this the case?

不是。 documention 表示如下：

Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants.

所以：

>>> tree = etree.fromstring('<section> outer <section> inner </section> </section>')
>>> etree.strip_tags(tree, 'section')
>>> etree.tostring(tree)
'<section> outer  inner  </section>'

您看到的行为与 <section> 标签无关，但它恰好是您代码段的最外层标签。因此，您问题的实际答案是 "because it's implemented that way".

要删除最外面的标签：是否可以更改创建 <section>...</section> 的代码来执行此操作？如果不是，ElementDepthFirstIterator 可能会成功：

>>> tree = etree.fromstring('<section> outer <section> inner </section> </section>')
>>> for val in etree.ElementDepthFirstIterator(tree, tag=None, inclusive=False):
...  print(etree.tostring(val))

b'<section> inner </section> '

为什么 lxml 不剥离部分标签？

Why won't lxml strip section tags?

html

python

lxml