如何使用 python 中的 Beautifulsoup 抓取地址(逗号分隔文本)
How to scrape address (comma separated text) using Beautifulsoup in python
我正在尝试从下面抓取地址 link:
https://www.yelp.com/biz/rollin-phatties-houston
但我只得到地址的第一个值(即:1731 Westheimer Rd
),完整地址用逗号分隔:
1731 Westheimer Rd, Houston, TX 77098
谁能帮我解决这个问题,请在下面找到我的代码:
import bs4 as bs
import urllib.request as url
source = url.urlopen('https://www.yelp.com/biz/rollin-phatties-houston')
soup = bs.BeautifulSoup(source, 'html.parser')
mains = soup.find_all("div", {"class": "secondaryAttributes__09f24__3db5x arrange-unit__09f24__1gZC1 border-color--default__09f24__R1nRO"})
main = mains[0] #First item of mains
address = []
for main in mains:
try:
address.append(main.address.find("p").text)
except:
address.append("")
print(address)
# 1731 Westheimer Rd
不需要通过查看元素来查找地址信息,实际上,javascript标签元素中的数据已经传递到页面上了。可以通过以下代码获取
import chompjs
import bs4 as bs
import urllib.request as url
source = url.urlopen('https://www.yelp.com/biz/rollin-phatties-houston')
soup = bs.BeautifulSoup(source, 'html.parser')
javascript = soup.select("script")[16].string
data = chompjs.parse_js_object(javascript)
data['bizDetailsPageProps']['bizContactInfoProps']['businessAddress']
import requests
import re
from ast import literal_eval
def main(url):
r = requests.get(url)
match = literal_eval(
re.search(r'addressLines.+?(\[.+?])', r.text).group(1))
print(*match)
main('https://www.yelp.com/biz/rollin-phatties-houston')
输出:
1731 Westheimer Rd Houston, TX 77098
网页上显示的公司地址是动态生成的。如果查看 URL 的 Page Source,您会发现餐厅的地址存储在脚本元素中。所以你需要从中提取地址。
from bs4 import BeautifulSoup
import requests
import json
page = requests.get('https://www.yelp.com/biz/rollin-phatties-houston')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script', attrs={'type':'application/json'})
scriptcontent = scriptelements[2].text
scriptcontent = scriptcontent.replace('<!--', '')
scriptcontent = scriptcontent.replace('-->', '')
jsondata = json.loads(scriptcontent)
print(jsondata['bizDetailsPageProps']['bizContactInfoProps']['businessAddress'])
使用上面的代码,您将能够提取任何商家的地址。
我正在尝试从下面抓取地址 link:
https://www.yelp.com/biz/rollin-phatties-houston
但我只得到地址的第一个值(即:1731 Westheimer Rd
),完整地址用逗号分隔:
1731 Westheimer Rd, Houston, TX 77098
谁能帮我解决这个问题,请在下面找到我的代码:
import bs4 as bs
import urllib.request as url
source = url.urlopen('https://www.yelp.com/biz/rollin-phatties-houston')
soup = bs.BeautifulSoup(source, 'html.parser')
mains = soup.find_all("div", {"class": "secondaryAttributes__09f24__3db5x arrange-unit__09f24__1gZC1 border-color--default__09f24__R1nRO"})
main = mains[0] #First item of mains
address = []
for main in mains:
try:
address.append(main.address.find("p").text)
except:
address.append("")
print(address)
# 1731 Westheimer Rd
不需要通过查看元素来查找地址信息,实际上,javascript标签元素中的数据已经传递到页面上了。可以通过以下代码获取
import chompjs
import bs4 as bs
import urllib.request as url
source = url.urlopen('https://www.yelp.com/biz/rollin-phatties-houston')
soup = bs.BeautifulSoup(source, 'html.parser')
javascript = soup.select("script")[16].string
data = chompjs.parse_js_object(javascript)
data['bizDetailsPageProps']['bizContactInfoProps']['businessAddress']
import requests
import re
from ast import literal_eval
def main(url):
r = requests.get(url)
match = literal_eval(
re.search(r'addressLines.+?(\[.+?])', r.text).group(1))
print(*match)
main('https://www.yelp.com/biz/rollin-phatties-houston')
输出:
1731 Westheimer Rd Houston, TX 77098
网页上显示的公司地址是动态生成的。如果查看 URL 的 Page Source,您会发现餐厅的地址存储在脚本元素中。所以你需要从中提取地址。
from bs4 import BeautifulSoup
import requests
import json
page = requests.get('https://www.yelp.com/biz/rollin-phatties-houston')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script', attrs={'type':'application/json'})
scriptcontent = scriptelements[2].text
scriptcontent = scriptcontent.replace('<!--', '')
scriptcontent = scriptcontent.replace('-->', '')
jsondata = json.loads(scriptcontent)
print(jsondata['bizDetailsPageProps']['bizContactInfoProps']['businessAddress'])
使用上面的代码,您将能够提取任何商家的地址。