如何删除 BeautifulSoup 输出中的冗余 space
How to remove redundant space in BeautifulSoup output
我打算使用 BeautifulSoup 抓取一个网站。我正在研究以下 HTML :
html =
<div id="article-body" itemprop="articleBody">
<p>
<span class="quote down bgQuote" data-channel="/quotes/zigman/170324/composite" data-bgformat="">
<a class="qt-chip trackable" data-fancyid="XNYSStockSLB" href="/investing/stock/slb?mod=MW_story_quote" data-track-mod="MW_story_quote">
SLB,
<span class="bgPercentChange">-3.04%</span>
</a>
</span>
reported late Thursday
<a href="/story/schlumberger-profit-falls-sharply-2016-10-20-174854654" class="icon none">higher third-quarter profit that beat targets and sales only slightly below estimates</a>
. Schlumberger’s results came a day after rival Halliburton Co.
<span class="quote down bgQuote" data-channel="/quotes/zigman/228631/composite" data-bgformat="">
<a class="qt-chip trackable" data-fancyid="XNYSStockHAL" href="/investing/stock/hal?mod=MW_story_quote" data-track-mod="MW_story_quote">
HAL,
<span class="bgPercentChange">-0.66%</span>
</a> """
我想得到一个没有任何冗余的纯文本space,我按照Twig的答案但是SLB和-3.04%以及HAL和-0.66%仍然放在不同的lines.My 有利的输出就像:
SLB, -3.04% reported late Thursday higher third-quarter profit that beat targets and sales only slightly below estimates. Schlumberger’s results came a day after rival Halliburton Co. HAL, -0.66% also posted higher-than-expected profit.
这是我的代码:
import urllib2
from bs4 import BeautifulSoup
import re
newsText = soap(html)
text = list(newsText.stripped_strings)
finalText = "\n\n".join(text) if descriptions else ""
re.sub(r'[\ \n]{2,}', '', finalText)
print finalText
非常感谢。
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text(strip=True, separator=' ')
print(text)
输出:
SLB, -3.04% reported late Thursday higher third-quarter profit that beat targets and sales only slightly below estimates . Schlumberger’s results came a day after rival Halliburton Co. HAL, -0.66%
我打算使用 BeautifulSoup 抓取一个网站。我正在研究以下 HTML :
html =
<div id="article-body" itemprop="articleBody">
<p>
<span class="quote down bgQuote" data-channel="/quotes/zigman/170324/composite" data-bgformat="">
<a class="qt-chip trackable" data-fancyid="XNYSStockSLB" href="/investing/stock/slb?mod=MW_story_quote" data-track-mod="MW_story_quote">
SLB,
<span class="bgPercentChange">-3.04%</span>
</a>
</span>
reported late Thursday
<a href="/story/schlumberger-profit-falls-sharply-2016-10-20-174854654" class="icon none">higher third-quarter profit that beat targets and sales only slightly below estimates</a>
. Schlumberger’s results came a day after rival Halliburton Co.
<span class="quote down bgQuote" data-channel="/quotes/zigman/228631/composite" data-bgformat="">
<a class="qt-chip trackable" data-fancyid="XNYSStockHAL" href="/investing/stock/hal?mod=MW_story_quote" data-track-mod="MW_story_quote">
HAL,
<span class="bgPercentChange">-0.66%</span>
</a> """
我想得到一个没有任何冗余的纯文本space,我按照Twig的答案但是SLB和-3.04%以及HAL和-0.66%仍然放在不同的lines.My 有利的输出就像:
SLB, -3.04% reported late Thursday higher third-quarter profit that beat targets and sales only slightly below estimates. Schlumberger’s results came a day after rival Halliburton Co. HAL, -0.66% also posted higher-than-expected profit.
这是我的代码:
import urllib2
from bs4 import BeautifulSoup
import re
newsText = soap(html)
text = list(newsText.stripped_strings)
finalText = "\n\n".join(text) if descriptions else ""
re.sub(r'[\ \n]{2,}', '', finalText)
print finalText
非常感谢。
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text(strip=True, separator=' ')
print(text)
输出:
SLB, -3.04% reported late Thursday higher third-quarter profit that beat targets and sales only slightly below estimates . Schlumberger’s results came a day after rival Halliburton Co. HAL, -0.66%