我如何 Python 通过网络抓取使用 JavaScript 函数的 CSV 下载按钮?
How can I Python web scrape a CSV download button that uses a JavaScript function?
我正在尝试 Python 每天为学校项目抓取此网页:https://thereserve2.apx.com/myModule/rpt/myrpt.asp?r=206
我希望此 Python 脚本不打开浏览器 window 来执行此操作。我正努力在 Python 上做得更好,并且想学习不走 Selenium 路线的抓取方法。
在右上角,有一个下载为 TXT 的按钮,它调用 JavaScript 函数从该报告中检索完整数据。我希望能够模仿来自 Python 的请求并检索生成的 txt 文件,然后将其保存到特定路径。
到目前为止,我已经在 Chrome 开发人员工具中打开了“网络”选项卡,并记录了单击按钮的操作。它似乎正在向 URL https://thereserve2.apx.com/myModule/include/rptdownload.asp with the data below.
发送 post 请求
我正在尝试在 Python 中模仿相同的 post 请求,以便我可以获得该请求生成的 txt 文件。
from urllib import request, parse
data_dict = {
'Data':'Stamp_1',
'Title':'Retired Offset Credits',
'Exclude':',rhid,ftType,Other Attributes here,Make Public,ahid,',
'Columns':'all,Account Holder,Quantity of Offset Credits,FacilityName,Email,Status Effective',
'Masks':'|||||MM/DD/YYYY',
'ClassMasks':',,#.0,,,',
'Headings':',,,Project Name,,',
'FormatType':'txt'
}
data = parse.urlencode(data_dict).encode()
req = request.Request('https://thereserve2.apx.com/myModule/include/rptdownload.asp', data=data_dict)
resp = request.urlretrieve(req, 'download.txt')
这不起作用 - 我收到“TypeError:预期的字符串或类似字节的对象”。我觉得我离这里越来越近了,但我似乎无法将 post 请求转换为我想要的文件下载或 table 拉取。任何帮助将不胜感激。
还需要 cookie 才能正常工作~
import requests
from io import StringIO
import pandas as pd
data = {
'myFilter': '',
'Data': 'Stamp_0',
'Title': 'Retired Offset Credits',
'Exclude': ',rhid,ftType,Other Attributes here,Make Public,ahid,',
'Columns': 'all,Account Holder,Quantity of Offset Credits,FacilityName,Email,Status Effective',
'Masks': '|||||MM/DD/YYYY',
'ClassMasks': ',,#.0,,,',
'Headings': ',,,Project Name,,',
'Parameters': '',
'ParametersOriginal': '',
'SortORder': '',
'FormatType': 'txt',
'ReplaceExpression': '',
'ReplaceValue': '',
}
cookies = {
'ASPSESSIONIDCGTRQSDS': 'DFDMDAFDFEPACLKJAAPHHBDH',
}
# Get the file
response = requests.post('https://thereserve2.apx.com/myModule/include/rptdownload.asp', cookies=cookies, data=data)
# Look at the file
df = pd.read_table(StringIO(response.text), sep=',', on_bad_lines='warn')
print(df.head())
# Write the file
with open('download.txt', 'wb') as f:
f.write(response.content)
输出:
Vintage Offset Credit Serial Numbers Quantity of Offset Credits Status Effective Project ID Project Name Project Type Protocol Version Project Site Location Project Site State Project Site Country Additional Certification(s) CORSIA Eligible Account Holder Retirement Reason Retirement Reason Details Unnamed: 16
0 2021 CAR-1-US-888-4-666-TX-2021-6665-1 to 17444 17444 12/09/2021 CAR888 Angelina County Landfill Landfill Gas Capture/Combustion Version 3.0 Lufkin TEXAS US NaN No Element Markets Emissions, LLC On Behalf of Third Party NaN NaN
1 2021 CAR-1-US-1247-37-234-MT-2021-6653-1 to 110 110 04/20/2022 CAR1247 Bluesource - Carroll Avoided Grassland Convers... Avoided Grassland Conversion Version 1.0 Valley County, MT MONTANA US NaN No Cool Effect Environmental Benefit NaN NaN
2 2021 CAR-1-MX-1282-42-938-PU-2021-6736-1 to 1604 1604 02/17/2022 CAR1282 Captura de carbono en San Rafael Ixtapalucan Forestry - MX Version 1.5 San Rafael Ixtapalucan PUEBLA MX NaN No Cultivo Land PBC On Behalf of Third Party Meta / Facebook Sustainability Goals NaN
3 2021 CAR-1-MX-1282-42-938-PU-2021-6734-1 to 5 5 02/17/2022 CAR1282 Captura de carbono en San Rafael Ixtapalucan Forestry - MX Version 1.5 San Rafael Ixtapalucan PUEBLA MX NaN No Cultivo Land PBC On Behalf of Third Party Meta / Facebook Sustainability Goals NaN
4 2021 CAR-1-MX-1415-42-938-OA-2021-6719-1 to 213 213 12/06/2021 CAR1415 Carbono, Agua y Biodiversidad Indígena Capulálpam Forestry - MX Version 2.0 Capulálpam de Méndez, Oaxaca OAXACA MX NaN No Cool Effect Environmental Benefit NaN NaN
我正在尝试 Python 每天为学校项目抓取此网页:https://thereserve2.apx.com/myModule/rpt/myrpt.asp?r=206
我希望此 Python 脚本不打开浏览器 window 来执行此操作。我正努力在 Python 上做得更好,并且想学习不走 Selenium 路线的抓取方法。
在右上角,有一个下载为 TXT 的按钮,它调用 JavaScript 函数从该报告中检索完整数据。我希望能够模仿来自 Python 的请求并检索生成的 txt 文件,然后将其保存到特定路径。
到目前为止,我已经在 Chrome 开发人员工具中打开了“网络”选项卡,并记录了单击按钮的操作。它似乎正在向 URL https://thereserve2.apx.com/myModule/include/rptdownload.asp with the data below.
我正在尝试在 Python 中模仿相同的 post 请求,以便我可以获得该请求生成的 txt 文件。
from urllib import request, parse
data_dict = {
'Data':'Stamp_1',
'Title':'Retired Offset Credits',
'Exclude':',rhid,ftType,Other Attributes here,Make Public,ahid,',
'Columns':'all,Account Holder,Quantity of Offset Credits,FacilityName,Email,Status Effective',
'Masks':'|||||MM/DD/YYYY',
'ClassMasks':',,#.0,,,',
'Headings':',,,Project Name,,',
'FormatType':'txt'
}
data = parse.urlencode(data_dict).encode()
req = request.Request('https://thereserve2.apx.com/myModule/include/rptdownload.asp', data=data_dict)
resp = request.urlretrieve(req, 'download.txt')
这不起作用 - 我收到“TypeError:预期的字符串或类似字节的对象”。我觉得我离这里越来越近了,但我似乎无法将 post 请求转换为我想要的文件下载或 table 拉取。任何帮助将不胜感激。
还需要 cookie 才能正常工作~
import requests
from io import StringIO
import pandas as pd
data = {
'myFilter': '',
'Data': 'Stamp_0',
'Title': 'Retired Offset Credits',
'Exclude': ',rhid,ftType,Other Attributes here,Make Public,ahid,',
'Columns': 'all,Account Holder,Quantity of Offset Credits,FacilityName,Email,Status Effective',
'Masks': '|||||MM/DD/YYYY',
'ClassMasks': ',,#.0,,,',
'Headings': ',,,Project Name,,',
'Parameters': '',
'ParametersOriginal': '',
'SortORder': '',
'FormatType': 'txt',
'ReplaceExpression': '',
'ReplaceValue': '',
}
cookies = {
'ASPSESSIONIDCGTRQSDS': 'DFDMDAFDFEPACLKJAAPHHBDH',
}
# Get the file
response = requests.post('https://thereserve2.apx.com/myModule/include/rptdownload.asp', cookies=cookies, data=data)
# Look at the file
df = pd.read_table(StringIO(response.text), sep=',', on_bad_lines='warn')
print(df.head())
# Write the file
with open('download.txt', 'wb') as f:
f.write(response.content)
输出:
Vintage Offset Credit Serial Numbers Quantity of Offset Credits Status Effective Project ID Project Name Project Type Protocol Version Project Site Location Project Site State Project Site Country Additional Certification(s) CORSIA Eligible Account Holder Retirement Reason Retirement Reason Details Unnamed: 16
0 2021 CAR-1-US-888-4-666-TX-2021-6665-1 to 17444 17444 12/09/2021 CAR888 Angelina County Landfill Landfill Gas Capture/Combustion Version 3.0 Lufkin TEXAS US NaN No Element Markets Emissions, LLC On Behalf of Third Party NaN NaN
1 2021 CAR-1-US-1247-37-234-MT-2021-6653-1 to 110 110 04/20/2022 CAR1247 Bluesource - Carroll Avoided Grassland Convers... Avoided Grassland Conversion Version 1.0 Valley County, MT MONTANA US NaN No Cool Effect Environmental Benefit NaN NaN
2 2021 CAR-1-MX-1282-42-938-PU-2021-6736-1 to 1604 1604 02/17/2022 CAR1282 Captura de carbono en San Rafael Ixtapalucan Forestry - MX Version 1.5 San Rafael Ixtapalucan PUEBLA MX NaN No Cultivo Land PBC On Behalf of Third Party Meta / Facebook Sustainability Goals NaN
3 2021 CAR-1-MX-1282-42-938-PU-2021-6734-1 to 5 5 02/17/2022 CAR1282 Captura de carbono en San Rafael Ixtapalucan Forestry - MX Version 1.5 San Rafael Ixtapalucan PUEBLA MX NaN No Cultivo Land PBC On Behalf of Third Party Meta / Facebook Sustainability Goals NaN
4 2021 CAR-1-MX-1415-42-938-OA-2021-6719-1 to 213 213 12/06/2021 CAR1415 Carbono, Agua y Biodiversidad Indígena Capulálpam Forestry - MX Version 2.0 Capulálpam de Méndez, Oaxaca OAXACA MX NaN No Cool Effect Environmental Benefit NaN NaN