如何替换 beautifulSoup 中的特定字符?
How do you replace specific characters in beautifulSoup?
我很清楚如何使用 bs4 替换标签中的文本,但我实际上如何将特定字符(比如 p-tag)更改为另一个包含在 b-tag 中的字符或字符串?
例如,如果我想 bold/highlight 段落中的所有 j。
您可以使用 find_all()
函数获取所有 <p>
元素和一个正则表达式来为您想要的字母注入 <b>
元素,例如:
from bs4 import BeautifulSoup
import sys
import re
soup = BeautifulSoup(open(sys.argv[1]))
for p in soup.find_all('p'):
p.string = re.sub(r'(r)', r'<b></b>', p.string)
print(soup.prettify(formatter=None))
请注意,我使用 formatter=None
来避免 HTML 个实体的转换。
使用此测试文本:
<div>
<div class="post-text" itemprop="text">
<p>I'm well aware on how to replace texts in tags using bs4 but how would I actually change a specific character in, say a p-tag, into another character or string enclosed in a b-tag?</p>
<p>An example would be if I wanted to bold/highlight all the j's in a paragraph.</p>
</div>
</div>
运行 喜欢:
python script.py infile
结果:
<html>
<body>
<div>
<div class="post-text" itemprop="text">
<p>
I'm well awa<b>r</b>e on how to <b>r</b>eplace texts in tags using bs4 but how would I actually change a specific cha<b>r</b>acte<b>r</b> in, say a p-tag, into anothe<b>r</b> cha<b>r</b>acte<b>r</b> o<b>r</b> st<b>r</b>ing enclosed in a b-tag?
</p>
<p>
An example would be if I wanted to bold/highlight all the j's in a pa<b>r</b>ag<b>r</b>aph.
</p>
</div>
</div>
</body>
</html>
如果要在文本中插入标签,则必须将整个文本分成 3 部分;之前的所有内容,进入标签的文本,以及之后的所有内容。
每次在文本中找到匹配项时都必须这样做,因此您也需要在插入后跟踪结尾部分:
def inject_tag(text, start, end, tagname, **attrs):
# find the document root
root = text
while root.parent:
root = root.parent
before = root.new_string(text[:start])
new_tag = root.new_tag(tagname, **attrs)
new_tag.string = text[start:end]
after = root.new_string(text[end:])
text.replace_with(before)
before.insert_after(new_tag)
new_tag.insert_after(after)
return after
然后用上面的函数替换具体的索引:
>>> import re
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <p>The quick brown fox jumps over the lazy dog</p>
... ''')
>>> the = re.compile(r'the', flags=re.I)
>>> text = soup.p.string
>>> while True:
... match = the.search(unicode(text))
... if not match: break
... start, stop = match.span()
... text = inject_tag(text, start, stop, 'b')
...
>>> print soup.prettify()
<html>
<head>
</head>
<body>
<p>
<b>
The
</b>
quick brown fox jumps over
<b>
the
</b>
lazy dog
</p>
</body>
</html>
我很清楚如何使用 bs4 替换标签中的文本,但我实际上如何将特定字符(比如 p-tag)更改为另一个包含在 b-tag 中的字符或字符串?
例如,如果我想 bold/highlight 段落中的所有 j。
您可以使用 find_all()
函数获取所有 <p>
元素和一个正则表达式来为您想要的字母注入 <b>
元素,例如:
from bs4 import BeautifulSoup
import sys
import re
soup = BeautifulSoup(open(sys.argv[1]))
for p in soup.find_all('p'):
p.string = re.sub(r'(r)', r'<b></b>', p.string)
print(soup.prettify(formatter=None))
请注意,我使用 formatter=None
来避免 HTML 个实体的转换。
使用此测试文本:
<div>
<div class="post-text" itemprop="text">
<p>I'm well aware on how to replace texts in tags using bs4 but how would I actually change a specific character in, say a p-tag, into another character or string enclosed in a b-tag?</p>
<p>An example would be if I wanted to bold/highlight all the j's in a paragraph.</p>
</div>
</div>
运行 喜欢:
python script.py infile
结果:
<html>
<body>
<div>
<div class="post-text" itemprop="text">
<p>
I'm well awa<b>r</b>e on how to <b>r</b>eplace texts in tags using bs4 but how would I actually change a specific cha<b>r</b>acte<b>r</b> in, say a p-tag, into anothe<b>r</b> cha<b>r</b>acte<b>r</b> o<b>r</b> st<b>r</b>ing enclosed in a b-tag?
</p>
<p>
An example would be if I wanted to bold/highlight all the j's in a pa<b>r</b>ag<b>r</b>aph.
</p>
</div>
</div>
</body>
</html>
如果要在文本中插入标签,则必须将整个文本分成 3 部分;之前的所有内容,进入标签的文本,以及之后的所有内容。
每次在文本中找到匹配项时都必须这样做,因此您也需要在插入后跟踪结尾部分:
def inject_tag(text, start, end, tagname, **attrs):
# find the document root
root = text
while root.parent:
root = root.parent
before = root.new_string(text[:start])
new_tag = root.new_tag(tagname, **attrs)
new_tag.string = text[start:end]
after = root.new_string(text[end:])
text.replace_with(before)
before.insert_after(new_tag)
new_tag.insert_after(after)
return after
然后用上面的函数替换具体的索引:
>>> import re
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <p>The quick brown fox jumps over the lazy dog</p>
... ''')
>>> the = re.compile(r'the', flags=re.I)
>>> text = soup.p.string
>>> while True:
... match = the.search(unicode(text))
... if not match: break
... start, stop = match.span()
... text = inject_tag(text, start, stop, 'b')
...
>>> print soup.prettify()
<html>
<head>
</head>
<body>
<p>
<b>
The
</b>
quick brown fox jumps over
<b>
the
</b>
lazy dog
</p>
</body>
</html>