为什么 lxml 不剥离部分标签?
Why won't lxml strip section tags?
我正在尝试用 lxml 和 Python 解析一些 HTML。我想删除部分标签。 lxml 似乎能够删除我指定的所有其他标签,但不能删除部分标签。
例如
test_html = '<section> <header> Test header </header> <p> Test text </p> </section>'
to_parse_html = etree.fromstring(test_html)
etree.strip_tags(to_parse_html,'header')
etree.tostring(to_parse_html)
'<section> Test header <p> Test text </p> </section>'
etree.strip_tags(to_parse_html,'p')
etree.tostring(to_parse_html)
'<section> Test header Test text </section>'
etree.strip_tags(to_parse_html,'section')
etree.tostring(to_parse_html)
'<section> Test header Test text </section>'
为什么会这样?
Why is this the case?
不是。 documention 表示如下:
Note that this will not delete the element (or ElementTree root
element) that you passed even if it matches. It will only treat its
descendants.
所以:
>>> tree = etree.fromstring('<section> outer <section> inner </section> </section>')
>>> etree.strip_tags(tree, 'section')
>>> etree.tostring(tree)
'<section> outer inner </section>'
您看到的行为与 <section>
标签无关,但它恰好是您代码段的最外层标签。因此,您问题的实际答案是 "because it's implemented that way".
要删除最外面的标签:是否可以更改创建 <section>...</section>
的代码来执行此操作?如果不是,ElementDepthFirstIterator
可能会成功:
>>> tree = etree.fromstring('<section> outer <section> inner </section> </section>')
>>> for val in etree.ElementDepthFirstIterator(tree, tag=None, inclusive=False):
... print(etree.tostring(val))
b'<section> inner </section> '
我正在尝试用 lxml 和 Python 解析一些 HTML。我想删除部分标签。 lxml 似乎能够删除我指定的所有其他标签,但不能删除部分标签。
例如
test_html = '<section> <header> Test header </header> <p> Test text </p> </section>'
to_parse_html = etree.fromstring(test_html)
etree.strip_tags(to_parse_html,'header')
etree.tostring(to_parse_html)
'<section> Test header <p> Test text </p> </section>'
etree.strip_tags(to_parse_html,'p')
etree.tostring(to_parse_html)
'<section> Test header Test text </section>'
etree.strip_tags(to_parse_html,'section')
etree.tostring(to_parse_html)
'<section> Test header Test text </section>'
为什么会这样?
Why is this the case?
不是。 documention 表示如下:
Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants.
所以:
>>> tree = etree.fromstring('<section> outer <section> inner </section> </section>')
>>> etree.strip_tags(tree, 'section')
>>> etree.tostring(tree)
'<section> outer inner </section>'
您看到的行为与 <section>
标签无关,但它恰好是您代码段的最外层标签。因此,您问题的实际答案是 "because it's implemented that way".
要删除最外面的标签:是否可以更改创建 <section>...</section>
的代码来执行此操作?如果不是,ElementDepthFirstIterator
可能会成功:
>>> tree = etree.fromstring('<section> outer <section> inner </section> </section>')
>>> for val in etree.ElementDepthFirstIterator(tree, tag=None, inclusive=False):
... print(etree.tostring(val))
b'<section> inner </section> '