如果数据不在 div 中,beautiful soup find_all 会跳过 class 索引
beautiful soup find_all skips a class index if data is not inside a div
我正在尝试 从网站上抓取 数据
了解我的概率这里有一些示例
第一次迭代
<span class="lot-details-desc right">,344 USD
</span>
<span class="lot-details-desc right">Automatic
</span>
<span class="lot-details-desc right">Mercedes
</span>
第二次迭代
<span class="lot-details-desc right">00 USD
</span>
<span class="lot-details-desc right"> #NO DATA HERE
</span>
<span class="lot-details-desc right">Mercedes
</span>
#在循环中
在使用美汤检索的同时
price = soup.find_all("span", {"class": "lot-details-desc right"})[0].get_text()
print(price)
trans = soup.find_all("span", {"class": "lot-details-desc right"})[1].get_text()
print(trans)
name = soup.find_all("span", {"class": "lot-details-desc right"})[2].get_text()
print(trans)
我得到结果
1st iteration
price=,344 USD
trans=Automatic
name=Mercedes
2nd iteration
price=00 USD
trans=Mercedes
name=ERRORRR( out of bound cuz this one find_all indicates only 0 and 1 index instead of 0 1 2)
如有任何建议,我们将不胜感激
该站点上的数据是通过 JavaScript 动态加载的。您可以使用 requests
模块直接从他们的 API 获取数据:
import re
import json
import requests
url = 'https://www.copart.com/lot/25831510/'
data_url = 'https://www.copart.com/public/data/lotdetails/solr/{lot_id}'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
lot_id = re.search(r'lot/(\d+)', url).group(1)
with requests.session() as s:
s.get(url, headers=headers).text # load cookies
data = s.get(data_url.format(lot_id=lot_id), headers=headers).json()
# ucomment this to see all data:
# print(json.dumps(data, indent=4))
name = data['data']['lotDetails']['mkn']
trans = data['data']['lotDetails']['tsmn']
price = data['data']['lotDetails']['la']
print('Name={} Trans={} Price={}'.format(name, trans, price))
打印:
Name=TOYOTA Trans=AUTOMATIC Price=7344.0
我正在尝试 从网站上抓取 数据 了解我的概率这里有一些示例
第一次迭代
<span class="lot-details-desc right">,344 USD
</span>
<span class="lot-details-desc right">Automatic
</span>
<span class="lot-details-desc right">Mercedes
</span>
第二次迭代
<span class="lot-details-desc right">00 USD
</span>
<span class="lot-details-desc right"> #NO DATA HERE
</span>
<span class="lot-details-desc right">Mercedes
</span>
#在循环中 在使用美汤检索的同时
price = soup.find_all("span", {"class": "lot-details-desc right"})[0].get_text()
print(price)
trans = soup.find_all("span", {"class": "lot-details-desc right"})[1].get_text()
print(trans)
name = soup.find_all("span", {"class": "lot-details-desc right"})[2].get_text()
print(trans)
我得到结果
1st iteration
price=,344 USD
trans=Automatic
name=Mercedes
2nd iteration
price=00 USD
trans=Mercedes
name=ERRORRR( out of bound cuz this one find_all indicates only 0 and 1 index instead of 0 1 2)
如有任何建议,我们将不胜感激
该站点上的数据是通过 JavaScript 动态加载的。您可以使用 requests
模块直接从他们的 API 获取数据:
import re
import json
import requests
url = 'https://www.copart.com/lot/25831510/'
data_url = 'https://www.copart.com/public/data/lotdetails/solr/{lot_id}'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
lot_id = re.search(r'lot/(\d+)', url).group(1)
with requests.session() as s:
s.get(url, headers=headers).text # load cookies
data = s.get(data_url.format(lot_id=lot_id), headers=headers).json()
# ucomment this to see all data:
# print(json.dumps(data, indent=4))
name = data['data']['lotDetails']['mkn']
trans = data['data']['lotDetails']['tsmn']
price = data['data']['lotDetails']['la']
print('Name={} Trans={} Price={}'.format(name, trans, price))
打印:
Name=TOYOTA Trans=AUTOMATIC Price=7344.0