使用 BeautifulSoup 删除 <p> 标签内的空格
Remove empty spaces inside <p> tags using BeautifulSoup
我的字符串 html 中有一些段落看起来像这样:
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</p>
我想删除 p
标签内的空格并将其变成:
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua</p>
请注意,像这样的 p
标签应该保留更改:
<p class="has-media media-640"><img alt="Lorem ipsum dolor sit amet" height="357" src="http://www.example.com/img/lorem.jpg" width="636"/></p>
我想要的是:
for p in soup.findAll('p'):
replace p.string with trimmed text
您可以将文本替换为 element.string.replace_with()
method:
for p in soup.find_all('p'):
if p.string:
p.string.replace_with(p.string.strip())
演示:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <p>
... Text with whitespace
... </p>
... <p>No whitespace</p>
... <p><span><img /></span></p>
... ''')
>>> for p in soup.find_all('p'):
... if p.string:
... p.string.replace_with(p.string.strip())
...
u'\n Text with whitespace\n'
u'No whitespace'
>>> print str(soup)
<html><head></head><body><p>Text with whitespace</p>
<p>No whitespace</p>
<p><span><img/></span></p>
</body></html>
这只会去除标签中直接包含的白色space 。如果您包含其他标签,则不会进行剥离。
您可以使用 element.strings
sequence 来处理带有嵌套标签的 <p>
标签。我 不会 trim 所有 space;如果存在,在每个字符串周围留下一个 space:
whitespace = u' \t\n\r\x0a' # extend as needed
for p in soup.find_all('p'):
for string in list(p.strings): # copy so we can replace some
left = string[:1] in whitespace
right = string[-1:] in whitespace
if not left and not right:
continue # leave be
new = string
if left:
new = ' ' + new.lstrip()
if right:
new = new.rstrip() + ' '
string.replace_with(new)
演示:
>>> soup = BeautifulSoup('''\
... <p>
... Text with whitespace
... </p>
... <p>No whitespace</p>
... <p>
... A nested
... <span>tag</span>
... is not a problem
... </p>
... ''')
>>> whitespace = u' \t\n\r\x0a' # extend as needed
>>> for p in soup.find_all('p'):
... for string in list(p.strings): # copy so we can replace some
... left = string[:1] in whitespace
... right = string[-1:] in whitespace
... if not left and not right:
... continue # leave be
... new = string
... if left:
... new = ' ' + new.lstrip()
... if right:
... new = new.rstrip() + ' '
... string.replace_with(new)
...
u'\n Text with whitespace\n'
u'\n A nested \n '
u'\n is not a problem\n'
>>> print str(soup)
<html><head></head><body><p> Text with whitespace </p>
<p>No whitespace</p>
<p> A nested <span>tag</span> is not a problem </p>
</body></html>
我的字符串 html 中有一些段落看起来像这样:
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</p>
我想删除 p
标签内的空格并将其变成:
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua</p>
请注意,像这样的 p
标签应该保留更改:
<p class="has-media media-640"><img alt="Lorem ipsum dolor sit amet" height="357" src="http://www.example.com/img/lorem.jpg" width="636"/></p>
我想要的是:
for p in soup.findAll('p'):
replace p.string with trimmed text
您可以将文本替换为 element.string.replace_with()
method:
for p in soup.find_all('p'):
if p.string:
p.string.replace_with(p.string.strip())
演示:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <p>
... Text with whitespace
... </p>
... <p>No whitespace</p>
... <p><span><img /></span></p>
... ''')
>>> for p in soup.find_all('p'):
... if p.string:
... p.string.replace_with(p.string.strip())
...
u'\n Text with whitespace\n'
u'No whitespace'
>>> print str(soup)
<html><head></head><body><p>Text with whitespace</p>
<p>No whitespace</p>
<p><span><img/></span></p>
</body></html>
这只会去除标签中直接包含的白色space 。如果您包含其他标签,则不会进行剥离。
您可以使用 element.strings
sequence 来处理带有嵌套标签的 <p>
标签。我 不会 trim 所有 space;如果存在,在每个字符串周围留下一个 space:
whitespace = u' \t\n\r\x0a' # extend as needed
for p in soup.find_all('p'):
for string in list(p.strings): # copy so we can replace some
left = string[:1] in whitespace
right = string[-1:] in whitespace
if not left and not right:
continue # leave be
new = string
if left:
new = ' ' + new.lstrip()
if right:
new = new.rstrip() + ' '
string.replace_with(new)
演示:
>>> soup = BeautifulSoup('''\
... <p>
... Text with whitespace
... </p>
... <p>No whitespace</p>
... <p>
... A nested
... <span>tag</span>
... is not a problem
... </p>
... ''')
>>> whitespace = u' \t\n\r\x0a' # extend as needed
>>> for p in soup.find_all('p'):
... for string in list(p.strings): # copy so we can replace some
... left = string[:1] in whitespace
... right = string[-1:] in whitespace
... if not left and not right:
... continue # leave be
... new = string
... if left:
... new = ' ' + new.lstrip()
... if right:
... new = new.rstrip() + ' '
... string.replace_with(new)
...
u'\n Text with whitespace\n'
u'\n A nested \n '
u'\n is not a problem\n'
>>> print str(soup)
<html><head></head><body><p> Text with whitespace </p>
<p>No whitespace</p>
<p> A nested <span>tag</span> is not a problem </p>
</body></html>