如何使用 python 中的 Beautifulsoup 抓取地址（逗号分隔文本）

Question

我正在尝试从下面抓取地址 link:

https://www.yelp.com/biz/rollin-phatties-houston

但我只得到地址的第一个值（即：1731 Westheimer Rd），完整地址用逗号分隔：

1731 Westheimer Rd, Houston, TX 77098

谁能帮我解决这个问题，请在下面找到我的代码：

import bs4 as bs
import urllib.request as url

source = url.urlopen('https://www.yelp.com/biz/rollin-phatties-houston')
soup = bs.BeautifulSoup(source, 'html.parser')

mains = soup.find_all("div", {"class": "secondaryAttributes__09f24__3db5x arrange-unit__09f24__1gZC1 border-color--default__09f24__R1nRO"})
main = mains[0] #First item of mains

address = []
for main in mains:
    try:       
        address.append(main.address.find("p").text)
    except:
        address.append("")

print(address)
# 1731 Westheimer Rd

Answer 1

不需要通过查看元素来查找地址信息，实际上，javascript标签元素中的数据已经传递到页面上了。可以通过以下代码获取

import chompjs
import bs4 as bs
import urllib.request as url

source = url.urlopen('https://www.yelp.com/biz/rollin-phatties-houston')
soup = bs.BeautifulSoup(source, 'html.parser')

javascript = soup.select("script")[16].string
data = chompjs.parse_js_object(javascript)
data['bizDetailsPageProps']['bizContactInfoProps']['businessAddress']

Answer 2

import requests
import re
from ast import literal_eval


def main(url):
    r = requests.get(url)
    match = literal_eval(
        re.search(r'addressLines.+?(\[.+?])', r.text).group(1))
    print(*match)


main('https://www.yelp.com/biz/rollin-phatties-houston')

输出：

1731 Westheimer Rd Houston, TX 77098

Answer 3

网页上显示的公司地址是动态生成的。如果查看 URL 的 Page Source，您会发现餐厅的地址存储在脚本元素中。所以你需要从中提取地址。

from bs4 import BeautifulSoup
import requests
import json
page = requests.get('https://www.yelp.com/biz/rollin-phatties-houston')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script', attrs={'type':'application/json'})
scriptcontent = scriptelements[2].text
scriptcontent = scriptcontent.replace('<!--', '')
scriptcontent = scriptcontent.replace('-->', '')
jsondata = json.loads(scriptcontent)
print(jsondata['bizDetailsPageProps']['bizContactInfoProps']['businessAddress'])

使用上面的代码，您将能够提取任何商家的地址。

如何使用 python 中的 Beautifulsoup 抓取地址（逗号分隔文本）

How to scrape address (comma separated text) using Beautifulsoup in python

python

text

urllib

beautifulsoup

python-3.x