保留用“<br/>”分隔的多行地址
Preserve multi-line addresses separated with `<br/>`
- 如何删除地址行之间多余的空行?我是
使用 Beautifulsoup 从网页中 抓取 。
- 我知道
<br/>
会生成一个新行。但是,如果我要使用
替换为 space OR strip(): 几行地址变成一行。
我怎样才能保持我仍然有一些地址行,如下面的预期输出所示?
来自html的输入:
<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />
我的代码如下:
if not (item.find('span', class_ = 'c2') is None):
address = item.find_all('span', class_ = 'c2')
for a in item.find_all('span', {"class":"c2"}):
for addr in address:
print('Before',addr)
if addr.find_all("br"):
for a in addr:
print('a',a)
if '<br/>' in a:
print('a loop',a)
我对 class(c2) 跨度的输出如下:
<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />
在span的循环中测试输出结果如下:
Before <span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br/>Karachi - 75640<br/>Pakistan</span>
a 1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),
a <br/>
a Karachi - 75640
a <br/>
a Pakistan
这导致我目前不希望的输出结果:
1233/B,LAC II,St. 37/B,Mehmoodabad #6,(United Bakery 后面),
卡拉奇 - 75640
巴基斯坦
预期输出结果:
1233/B, LAC II, St. 37/B, Mehmoodabad # 6,(United Bakery 后面),
卡拉奇 - 75640
巴基斯坦
您可以使用标签对象的replace_with()
方法:
from bs4 import BeautifulSoup
data = '''<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />'''
soup = BeautifulSoup(data, 'lxml')
for br in soup.select('br'):
br.replace_with('\n')
print(soup.text.strip())
打印:
1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),
Karachi - 75640
Pakistan
您可以使用剥离的字符串并加入
from bs4 import BeautifulSoup as bs
html = '''
<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />
'''
soup = bs(html, 'lxml')
for item in soup.select('.c2'):
strings = '\n'.join([string for string in item.stripped_strings])
print(strings)
- 如何删除地址行之间多余的空行?我是 使用 Beautifulsoup 从网页中 抓取 。
- 我知道
<br/>
会生成一个新行。但是,如果我要使用 替换为 space OR strip(): 几行地址变成一行。 我怎样才能保持我仍然有一些地址行,如下面的预期输出所示?
来自html的输入:
<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />
我的代码如下:
if not (item.find('span', class_ = 'c2') is None):
address = item.find_all('span', class_ = 'c2')
for a in item.find_all('span', {"class":"c2"}):
for addr in address:
print('Before',addr)
if addr.find_all("br"):
for a in addr:
print('a',a)
if '<br/>' in a:
print('a loop',a)
我对 class(c2) 跨度的输出如下:
<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />
在span的循环中测试输出结果如下:
Before <span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br/>Karachi - 75640<br/>Pakistan</span>
a 1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),
a <br/>
a Karachi - 75640
a <br/>
a Pakistan
这导致我目前不希望的输出结果:
1233/B,LAC II,St. 37/B,Mehmoodabad #6,(United Bakery 后面),
卡拉奇 - 75640
巴基斯坦
预期输出结果:
1233/B, LAC II, St. 37/B, Mehmoodabad # 6,(United Bakery 后面),
卡拉奇 - 75640
巴基斯坦
您可以使用标签对象的replace_with()
方法:
from bs4 import BeautifulSoup
data = '''<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />'''
soup = BeautifulSoup(data, 'lxml')
for br in soup.select('br'):
br.replace_with('\n')
print(soup.text.strip())
打印:
1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),
Karachi - 75640
Pakistan
您可以使用剥离的字符串并加入
from bs4 import BeautifulSoup as bs
html = '''
<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />
'''
soup = bs(html, 'lxml')
for item in soup.select('.c2'):
strings = '\n'.join([string for string in item.stripped_strings])
print(strings)