在 Beautifulsoup 中仅将内容更改为父元素文本

Change content only to parent element text, in Beautifulsoup

我有这段代码:

txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for ft in soup.findAll('p'):
        print str(ft).upper()

当 运行 我得到这个:

<P>HI <SPAN>MARK</SPAN>, HOW ARE YOU?, DON'T FORGET MEETING ON <STRONG>SUNDAY</STRONG>, OK?</P>

但我想得到这个:

<p>HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday<strong>, ok?</p>

我只想更改 p 标签的内部文本,但保留 p 内其他内部标签的格式,我还想将标签名称保持小写

谢谢

您可以将修改后的文本分配给标签 p.stringstring 属性。所以循环遍历<p>标签的所有内容,并使用正则表达式模块检查它是否包含标签符号<>并跳过它们。类似于:

from bs4 import BeautifulSoup
import re

txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for p in soup.find_all('p'):
    p.string = ''.join(
        [str(t).upper()
            if not re.match(r'<[^>]+>', str(t))
            else str(t)
            for t in p.contents])

print soup.prettify(formatter=None)

我使用formatter选项来避免html特殊符号的编码。它产生:

<html>
 <body>
  <p>
   HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday</strong>, OK?
  </p>
 </body>
</html>