使用 beautifulsoup 访问未标记的文本
Accessing untagged text using beautifulsoup
我正在使用 python 和 beautifulsoup4 来提取一些地址信息。
更具体地说,我在检索非美国邮政编码时需要帮助。
考虑一家美国公司的以下 html 数据:(已经是汤对象)
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">999 State St Ste 100</span><br/>
<span class="locality">Salt Lake City,</span>
<span class="region">UT</span>
<span class="zip">84114-0002,</span>
<br/><span class="country-name">United States</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+1-000-000-000
</span><br/>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com" target="_blank">http://www.website.com</a></p>
</div>
</ul>
</div>
我可以使用以下 python 代码提取邮政编码 (84114-0002):
class CompanyDescription:
def __init__(self, page):
self.data = page.find('div', attrs={'id': 'companyDescription'})
def address(self):
#TODO: Also retrieve the Zipcode for UK and German based addresses - tricky!
address = {'street-address': '', 'locality': '', 'region': '', 'zip': '', 'country-name': ''}
for key in address:
try:
adr = self.data.find('p', attrs={'id': 'adr'})
if adr.find('span', attrs={'class': key}) is None:
address[key] = ''
else:
address[key] = adr.find('span', attrs={'class': key}).text.split(',')[0]
# Attempting to grab another zip code value
if address['zip'] == '':
pass
except:
# We should return a dictionary with "" as key adr
return address
return address
你可以看到我需要一些建议 if address['zip'] == '':
这两个汤对象示例给我带来了麻烦。在下面我想检索 EC4N 4SA
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>
<p>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com.com" target="_blank">http://www.website.com.com</a></p>
</div>
<p><strong>Line of Business</strong> <br/>Management services, nsk</p>
</div>
以及下面,我有兴趣获得 71364
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+00-1234567
</span><br/>
<span class="tel"><strong class="type">Fax: </strong>+00-1234567</span>
</p>
</div>
</div>
现在,我 运行 这个程序超过了大约 68,000 个帐户,其中 28,000 个不是美国帐户。我只举出两个例子,我知道目前的方法不是防弹的。可能存在其他地址格式,此脚本无法按预期工作,但我相信弄清楚基于英国和德国的帐户会有很大帮助。
提前致谢
因为里面只有没有标签的文本 <p>
所以你可以使用
find_all(text=True, recursive=False)
只获取文本(没有标签)而不是来自嵌套标签(<span>
)。这给出了包含您的文本和一些 \n
和空格的列表,因此您可以使用 join()
创建一个字符串,并使用 strip()
删除所有 \n
和空格。
data = '''<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
结果:EC4N 4SA
同第二个HTML
data = '''<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
结果:71364
我正在使用 python 和 beautifulsoup4 来提取一些地址信息。 更具体地说,我在检索非美国邮政编码时需要帮助。
考虑一家美国公司的以下 html 数据:(已经是汤对象)
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">999 State St Ste 100</span><br/>
<span class="locality">Salt Lake City,</span>
<span class="region">UT</span>
<span class="zip">84114-0002,</span>
<br/><span class="country-name">United States</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+1-000-000-000
</span><br/>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com" target="_blank">http://www.website.com</a></p>
</div>
</ul>
</div>
我可以使用以下 python 代码提取邮政编码 (84114-0002):
class CompanyDescription:
def __init__(self, page):
self.data = page.find('div', attrs={'id': 'companyDescription'})
def address(self):
#TODO: Also retrieve the Zipcode for UK and German based addresses - tricky!
address = {'street-address': '', 'locality': '', 'region': '', 'zip': '', 'country-name': ''}
for key in address:
try:
adr = self.data.find('p', attrs={'id': 'adr'})
if adr.find('span', attrs={'class': key}) is None:
address[key] = ''
else:
address[key] = adr.find('span', attrs={'class': key}).text.split(',')[0]
# Attempting to grab another zip code value
if address['zip'] == '':
pass
except:
# We should return a dictionary with "" as key adr
return address
return address
你可以看到我需要一些建议 if address['zip'] == '':
这两个汤对象示例给我带来了麻烦。在下面我想检索 EC4N 4SA
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>
<p>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com.com" target="_blank">http://www.website.com.com</a></p>
</div>
<p><strong>Line of Business</strong> <br/>Management services, nsk</p>
</div>
以及下面,我有兴趣获得 71364
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+00-1234567
</span><br/>
<span class="tel"><strong class="type">Fax: </strong>+00-1234567</span>
</p>
</div>
</div>
现在,我 运行 这个程序超过了大约 68,000 个帐户,其中 28,000 个不是美国帐户。我只举出两个例子,我知道目前的方法不是防弹的。可能存在其他地址格式,此脚本无法按预期工作,但我相信弄清楚基于英国和德国的帐户会有很大帮助。
提前致谢
因为里面只有没有标签的文本 <p>
所以你可以使用
find_all(text=True, recursive=False)
只获取文本(没有标签)而不是来自嵌套标签(<span>
)。这给出了包含您的文本和一些 \n
和空格的列表,因此您可以使用 join()
创建一个字符串,并使用 strip()
删除所有 \n
和空格。
data = '''<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
结果:EC4N 4SA
同第二个HTML
data = '''<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
结果:71364