在 Beautifulsoup 中仅将内容更改为父元素文本
Change content only to parent element text, in Beautifulsoup
我有这段代码:
txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for ft in soup.findAll('p'):
print str(ft).upper()
当 运行 我得到这个:
<P>HI <SPAN>MARK</SPAN>, HOW ARE YOU?, DON'T FORGET MEETING ON <STRONG>SUNDAY</STRONG>, OK?</P>
但我想得到这个:
<p>HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday<strong>, ok?</p>
我只想更改 p 标签的内部文本,但保留 p 内其他内部标签的格式,我还想将标签名称保持小写
谢谢
您可以将修改后的文本分配给标签 p.string
的 string
属性。所以循环遍历<p>
标签的所有内容,并使用正则表达式模块检查它是否包含标签符号<
和>
并跳过它们。类似于:
from bs4 import BeautifulSoup
import re
txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for p in soup.find_all('p'):
p.string = ''.join(
[str(t).upper()
if not re.match(r'<[^>]+>', str(t))
else str(t)
for t in p.contents])
print soup.prettify(formatter=None)
我使用formatter
选项来避免html
特殊符号的编码。它产生:
<html>
<body>
<p>
HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday</strong>, OK?
</p>
</body>
</html>
我有这段代码:
txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for ft in soup.findAll('p'):
print str(ft).upper()
当 运行 我得到这个:
<P>HI <SPAN>MARK</SPAN>, HOW ARE YOU?, DON'T FORGET MEETING ON <STRONG>SUNDAY</STRONG>, OK?</P>
但我想得到这个:
<p>HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday<strong>, ok?</p>
我只想更改 p 标签的内部文本,但保留 p 内其他内部标签的格式,我还想将标签名称保持小写
谢谢
您可以将修改后的文本分配给标签 p.string
的 string
属性。所以循环遍历<p>
标签的所有内容,并使用正则表达式模块检查它是否包含标签符号<
和>
并跳过它们。类似于:
from bs4 import BeautifulSoup
import re
txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for p in soup.find_all('p'):
p.string = ''.join(
[str(t).upper()
if not re.match(r'<[^>]+>', str(t))
else str(t)
for t in p.contents])
print soup.prettify(formatter=None)
我使用formatter
选项来避免html
特殊符号的编码。它产生:
<html>
<body>
<p>
HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday</strong>, OK?
</p>
</body>
</html>