将多个标签与 lxml 组合
Combine multiple tags with lxml
我有一个 html 文件,它看起来像:
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
我需要的是,如果一个'p'块中的所有标签都是'strong',那么将它们合并成一行,即
<p>
<strong>This is a line which I want to join.</strong>
</p>
不接触另一个方块,因为它包含其他内容。
有什么建议吗?我正在使用 lxml。
更新:
到目前为止我试过:
for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')
使用这些代码,我能够去除所需部分中的强标签,给出:
<p>
This is a line which I want to join.
</p>
所以现在我只需要一种方法将标签放回...
我可以用 bs4 做到这一点 (BeautifulSoup):
from bs4 import BeautifulSoup as bs
html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""
soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>
然后使用replace_with()
:
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
打印:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
我已经设法解决了我自己的问题。
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
特别感谢@Scott 帮助我得出这个解决方案。虽然我不能标记他的答案是正确的,但我对他的指导表示赞赏。
或者,您可以使用更具体的 xpath 直接获取目标 p
元素:
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content
使用xpath的简要说明:
//p[strong]
: 查找 p
元素,在 XML/HTML 文档中的任何位置,具有子元素 strong
...
[not(*[not(self::strong)])]
: ..除了 strong
... 之外没有子元素
[not(text()[normalize-space()])]
: ..并且没有非空文本节点子节点。
normalize-space()
:从当前上下文元素中获取所有文本节点,并与标准化为单个 space 的连续白色 space 连接
我有一个 html 文件,它看起来像:
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
我需要的是,如果一个'p'块中的所有标签都是'strong',那么将它们合并成一行,即
<p>
<strong>This is a line which I want to join.</strong>
</p>
不接触另一个方块,因为它包含其他内容。
有什么建议吗?我正在使用 lxml。
更新:
到目前为止我试过:
for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')
使用这些代码,我能够去除所需部分中的强标签,给出:
<p>
This is a line which I want to join.
</p>
所以现在我只需要一种方法将标签放回...
我可以用 bs4 做到这一点 (BeautifulSoup):
from bs4 import BeautifulSoup as bs
html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""
soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>
然后使用replace_with()
:
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
打印:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
我已经设法解决了我自己的问题。
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
特别感谢@Scott 帮助我得出这个解决方案。虽然我不能标记他的答案是正确的,但我对他的指导表示赞赏。
或者,您可以使用更具体的 xpath 直接获取目标 p
元素:
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content
使用xpath的简要说明:
//p[strong]
: 查找p
元素,在 XML/HTML 文档中的任何位置,具有子元素strong
...[not(*[not(self::strong)])]
: ..除了strong
... 之外没有子元素
[not(text()[normalize-space()])]
: ..并且没有非空文本节点子节点。normalize-space()
:从当前上下文元素中获取所有文本节点,并与标准化为单个 space 的连续白色 space 连接