BS4 replace_with 用于替换为新标签
BS4 replace_with for replacing with new tag
我需要在 html 文件中找到某些词并用链接替换它们。结果应该是该文件(由浏览器显示)允许您像往常一样点击链接。
Beautiful Soup 自动转义标签。我怎样才能避免这种行为?
最小示例
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
html = \
'''
Identify
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub('identify', '<a href="test.html"> test </a>', txt.lower())
txt.replace_with(newtext)
print(soup)
结果:
<a href="test.html"> test </a>
预期结果:
<a href="test.html"> test </a>
您可以将带有标记的新汤作为参数添加到 .replace_with()
,例如:
import re
from bs4 import BeautifulSoup
html = '''
Other Identify Other
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
new_txt = re.sub(r'identi[^\s]*', '<a href="test.html">test</a>', txt, flags=re.I)
txt.replace_with(BeautifulSoup(new_txt, 'html.parser'))
print(soup)
打印:
Other <a href="test.html">test</a> Other
您可以使用w3lib
,它是replace_entities()
函数来替换来自HTML个实体一个字符串。
安装:pip install w3lib
from bs4 import BeautifulSoup
import re
from w3lib.html import replace_entities
html = \
'''
Identify
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub('identify', r'<a href="test.html"> test </a>', txt.lower())
txt.replace_with(newtext)
print(replace_entities(str(soup))) #str(soup) as its BeautifulSoup type not str
#Output
>>> <a href="test.html"> test </a>
我需要在 html 文件中找到某些词并用链接替换它们。结果应该是该文件(由浏览器显示)允许您像往常一样点击链接。 Beautiful Soup 自动转义标签。我怎样才能避免这种行为?
最小示例
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
html = \
'''
Identify
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub('identify', '<a href="test.html"> test </a>', txt.lower())
txt.replace_with(newtext)
print(soup)
结果:
<a href="test.html"> test </a>
预期结果:
<a href="test.html"> test </a>
您可以将带有标记的新汤作为参数添加到 .replace_with()
,例如:
import re
from bs4 import BeautifulSoup
html = '''
Other Identify Other
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
new_txt = re.sub(r'identi[^\s]*', '<a href="test.html">test</a>', txt, flags=re.I)
txt.replace_with(BeautifulSoup(new_txt, 'html.parser'))
print(soup)
打印:
Other <a href="test.html">test</a> Other
您可以使用w3lib
,它是replace_entities()
函数来替换来自HTML个实体一个字符串。
安装:pip install w3lib
from bs4 import BeautifulSoup
import re
from w3lib.html import replace_entities
html = \
'''
Identify
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub('identify', r'<a href="test.html"> test </a>', txt.lower())
txt.replace_with(newtext)
print(replace_entities(str(soup))) #str(soup) as its BeautifulSoup type not str
#Output
>>> <a href="test.html"> test </a>