使用 BeautifulSoup 删除 标签内的空格

Question

我的字符串 html 中有一些段落看起来像这样：

<p>
    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</p>

我想删除 p 标签内的空格并将其变成：

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua</p>

请注意，像这样的 p 标签应该保留更改：

<p class="has-media media-640"><img alt="Lorem ipsum dolor sit amet" height="357" src="http://www.example.com/img/lorem.jpg" width="636"/></p>

我想要的是：

for p in soup.findAll('p'):
    replace p.string with trimmed text

Answer 1

您可以将文本替换为 element.string.replace_with() method:

for p in soup.find_all('p'):
    if p.string:
        p.string.replace_with(p.string.strip())

演示：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <p>
...     Text with whitespace
... </p>
... <p>No whitespace</p>
... <p><span><img /></span></p>
... ''')
>>> for p in soup.find_all('p'):
...     if p.string:
...         p.string.replace_with(p.string.strip())
... 
u'\n    Text with whitespace\n'
u'No whitespace'
>>> print str(soup)
<html><head></head><body><p>Text with whitespace</p>
<p>No whitespace</p>
<p><span><img/></span></p>
</body></html>

这只会去除标签中直接包含的白色space 。如果您包含其他标签，则不会进行剥离。

您可以使用 element.strings sequence 来处理带有嵌套标签的  标签。我不会 trim 所有 space；如果存在，在每个字符串周围留下一个 space：

whitespace = u' \t\n\r\x0a' # extend as needed for p in soup.find_all('p'): for string in list(p.strings): # copy so we can replace some left = string[:1] in whitespace right = string[-1:] in whitespace if not left and not right: continue # leave be new = string if left: new = ' ' + new.lstrip() if right: new = new.rstrip() + ' ' string.replace_with(new)

演示：

>>> soup = BeautifulSoup('''\ ... ... Text with whitespace ... ... No whitespace ... ... A nested ... tag ... is not a problem ... ... ''') >>> whitespace = u' \t\n\r\x0a' # extend as needed >>> for p in soup.find_all('p'): ... for string in list(p.strings): # copy so we can replace some ... left = string[:1] in whitespace ... right = string[-1:] in whitespace ... if not left and not right: ... continue # leave be ... new = string ... if left: ... new = ' ' + new.lstrip() ... if right: ... new = new.rstrip() + ' ' ... string.replace_with(new) ... u'\n Text with whitespace\n' u'\n A nested \n ' u'\n is not a problem\n' >>> print str(soup) <html><head></head><body> Text with whitespace No whitespace A nested tag is not a problem </body></html>

使用 BeautifulSoup 删除 <p> 标签内的空格

Remove empty spaces inside <p> tags using BeautifulSoup

html

python

trim

beautifulsoup