从 AWS 定价中抓取标签 table
Scraping tabbed table from AWS pricing
我正在尝试构建 scraper 来抓取作为此页面中表格的选项卡 (https://aws.amazon.com/sagemaker/pricing/) 我只对 training
、processing
和其他一些数据感兴趣。
req = requests.get(url)
soup = bs4.BeautifulSoup(req.content)
tables = soup.find_all("table")
inst_table = str(tables[0])
但看起来我必须使用某种动态机制来获取选项卡式开关。
假设我们点击了训练选项卡,我的目标是构建一个存储抓取数据的文件
"ml.t2.medium": {
"vCPU": 2.0,
"mem_GiB": 4.0,
"price": 0.15,
"category": "Standard",
"task": "training",
}
好消息是你不需要 selenium
的重炮。
与 AWS 一样,几乎总是有一个 API 您可以查询 returns 您想要的数据。
这是您需要的东西以及获取方法:
import json
import time
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:94.0) Gecko/20100101 Firefox/94.0",
}
endpoint = f"https://b0.p.awsstatic.com/pricing/2.0/meteredUnitMaps/" \
f"sagemaker/USD/current/sagemaker-instances.json?" \
f"timestamp={int(time.time())}"
response = requests.get(endpoint, headers=headers).json()
for region, region_data in response["regions"].items():
if region == "EU (Frankfurt)":
for instance_type, instance_data in region_data.items():
print(json.dumps(instance_data, indent=2))
EU (Frankfurt)
的示例输出(为简洁起见缩短):
{
"rateCode": "X7Z5CZBN2ZY5QED6.JRTCKXETXF.6YS6EN2CT7",
"price": "6.1120000000",
"Instance": "ml.g4dn.12xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.12xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "48",
"Memory": "192 GiB"
}
{
"rateCode": "F926HEYB3SV5TQ3Y.JRTCKXETXF.6YS6EN2CT7",
"price": "6.8000000000",
"Instance": "ml.g4dn.16xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.16xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "64",
"Memory": "256 GiB"
}
{
"rateCode": "7SMSS7DTJHR8UWN7.JRTCKXETXF.6YS6EN2CT7",
"price": "1.8810000000",
"Instance": "ml.g4dn.4xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.4xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "16",
"Memory": "64 GiB"
}
and much more ...
我正在尝试构建 scraper 来抓取作为此页面中表格的选项卡 (https://aws.amazon.com/sagemaker/pricing/) 我只对 training
、processing
和其他一些数据感兴趣。
req = requests.get(url)
soup = bs4.BeautifulSoup(req.content)
tables = soup.find_all("table")
inst_table = str(tables[0])
但看起来我必须使用某种动态机制来获取选项卡式开关。
假设我们点击了训练选项卡,我的目标是构建一个存储抓取数据的文件
"ml.t2.medium": {
"vCPU": 2.0,
"mem_GiB": 4.0,
"price": 0.15,
"category": "Standard",
"task": "training",
}
好消息是你不需要 selenium
的重炮。
与 AWS 一样,几乎总是有一个 API 您可以查询 returns 您想要的数据。
这是您需要的东西以及获取方法:
import json
import time
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:94.0) Gecko/20100101 Firefox/94.0",
}
endpoint = f"https://b0.p.awsstatic.com/pricing/2.0/meteredUnitMaps/" \
f"sagemaker/USD/current/sagemaker-instances.json?" \
f"timestamp={int(time.time())}"
response = requests.get(endpoint, headers=headers).json()
for region, region_data in response["regions"].items():
if region == "EU (Frankfurt)":
for instance_type, instance_data in region_data.items():
print(json.dumps(instance_data, indent=2))
EU (Frankfurt)
的示例输出(为简洁起见缩短):
{
"rateCode": "X7Z5CZBN2ZY5QED6.JRTCKXETXF.6YS6EN2CT7",
"price": "6.1120000000",
"Instance": "ml.g4dn.12xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.12xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "48",
"Memory": "192 GiB"
}
{
"rateCode": "F926HEYB3SV5TQ3Y.JRTCKXETXF.6YS6EN2CT7",
"price": "6.8000000000",
"Instance": "ml.g4dn.16xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.16xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "64",
"Memory": "256 GiB"
}
{
"rateCode": "7SMSS7DTJHR8UWN7.JRTCKXETXF.6YS6EN2CT7",
"price": "1.8810000000",
"Instance": "ml.g4dn.4xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.4xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "16",
"Memory": "64 GiB"
}
and much more ...