BS4 replace_with 结果不再在树中
BS4 replace_with result is no longer in tree
我需要替换 html 文档中的多个单词。 Atm 我通过为每个替换调用一次 replace_with 来做到这一点。在 NavigableString 上调用 replace_with 两次会导致 ValueError(参见下面的示例),因为被替换的元素不再在树中。
最小示例
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
def test1():
html = \
'''
Identify
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
if re.search('identify',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub('identify', '<a href="test.html"> test </a>', txt.lower())
txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
# I called it twice here to make the code as small as possible.
# Usually it would be a different newtext ..
# which was created using the replaced txt looking for a different word to replace.
return soup
print(test1())
预期结果:
The txt is == newstring
结果:
ValueError: Cannot replace one element with another when the element to be replaced is not
part of the tree.
一个简单的解决方案就是修改新字符串,最后只一次性全部替换,但我想了解当前的现象。
第一个 txt.replace_with(...)
从文档树 (doc) 中删除 NavigableString
(此处存储在变量 txt
中)。这有效地将 txt.parent
设置为 None
第二个 txt.replace_with(...)
查看 parent
属性,找到 None
(因为 txt
已经从树中删除)并抛出 ValueError。
正如您在问题末尾所说,一个解决方案是只使用 .replace_with()
一次:
import re
from bs4 import BeautifulSoup
def test1():
html = \
'''
word1 word2 word3 word4
'''
soup = BeautifulSoup(html,features="html.parser")
to_delete = []
for txt in soup.findAll(text=True):
if re.search('word1', txt, flags=re.I) and txt.parent.name != 'a':
newtext = re.sub('word1', '<a href="test.html"> test1 </a>', txt.lower())
# ...some computations
newtext = re.sub('word3', '<a href="test.html"> test2 </a>', newtext)
# ...some more computations
# and at the end, replce txt only once:
txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
return soup
print(test1())
打印:
<a href="test.html"> test1 </a> word2 <a href="test.html"> test2 </a> word4
我需要替换 html 文档中的多个单词。 Atm 我通过为每个替换调用一次 replace_with 来做到这一点。在 NavigableString 上调用 replace_with 两次会导致 ValueError(参见下面的示例),因为被替换的元素不再在树中。
最小示例
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
def test1():
html = \
'''
Identify
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
if re.search('identify',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub('identify', '<a href="test.html"> test </a>', txt.lower())
txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
# I called it twice here to make the code as small as possible.
# Usually it would be a different newtext ..
# which was created using the replaced txt looking for a different word to replace.
return soup
print(test1())
预期结果:
The txt is == newstring
结果:
ValueError: Cannot replace one element with another when the element to be replaced is not
part of the tree.
一个简单的解决方案就是修改新字符串,最后只一次性全部替换,但我想了解当前的现象。
第一个 txt.replace_with(...)
从文档树 (doc) 中删除 NavigableString
(此处存储在变量 txt
中)。这有效地将 txt.parent
设置为 None
第二个 txt.replace_with(...)
查看 parent
属性,找到 None
(因为 txt
已经从树中删除)并抛出 ValueError。
正如您在问题末尾所说,一个解决方案是只使用 .replace_with()
一次:
import re
from bs4 import BeautifulSoup
def test1():
html = \
'''
word1 word2 word3 word4
'''
soup = BeautifulSoup(html,features="html.parser")
to_delete = []
for txt in soup.findAll(text=True):
if re.search('word1', txt, flags=re.I) and txt.parent.name != 'a':
newtext = re.sub('word1', '<a href="test.html"> test1 </a>', txt.lower())
# ...some computations
newtext = re.sub('word3', '<a href="test.html"> test2 </a>', newtext)
# ...some more computations
# and at the end, replce txt only once:
txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
return soup
print(test1())
打印:
<a href="test.html"> test1 </a> word2 <a href="test.html"> test2 </a> word4