我如何 Python 通过网络抓取使用 JavaScript 函数的 CSV 下载按钮?

How can I Python web scrape a CSV download button that uses a JavaScript function?

我正在尝试 Python 每天为学校项目抓取此网页:https://thereserve2.apx.com/myModule/rpt/myrpt.asp?r=206

我希望此 Python 脚本不打开浏览器 window 来执行此操作。我正努力在 Python 上做得更好,并且想学习不走 Selenium 路线的抓取方法。

在右上角,有一个下载为 TXT 的按钮,它调用 JavaScript 函数从该报告中检索完整数据。我希望能够模仿来自 Python 的请求并检索生成的 txt 文件,然后将其保存到特定路径。

到目前为止,我已经在 Chrome 开发人员工具中打开了“网络”选项卡,并记录了单击按钮的操作。它似乎正在向 URL https://thereserve2.apx.com/myModule/include/rptdownload.asp with the data below.

发送 post 请求

我正在尝试在 Python 中模仿相同的 post 请求,以便我可以获得该请求生成的 txt 文件。

from urllib import request, parse

data_dict = {
        'Data':'Stamp_1',
        'Title':'Retired Offset Credits',
        'Exclude':',rhid,ftType,Other Attributes here,Make Public,ahid,',
        'Columns':'all,Account Holder,Quantity of Offset Credits,FacilityName,Email,Status Effective',
        'Masks':'|||||MM/DD/YYYY',
        'ClassMasks':',,#.0,,,',
        'Headings':',,,Project Name,,',
        'FormatType':'txt'
        }

data = parse.urlencode(data_dict).encode()
req =  request.Request('https://thereserve2.apx.com/myModule/include/rptdownload.asp', data=data_dict)
resp = request.urlretrieve(req, 'download.txt')

这不起作用 - 我收到“TypeError:预期的字符串或类似字节的对象”。我觉得我离这里越来越近了,但我似乎无法将 post 请求转换为我想要的文件下载或 table 拉取。任何帮助将不胜感激。

还需要 cookie 才能正常工作~

import requests
from io import StringIO
import pandas as pd

data = {
    'myFilter': '',
    'Data': 'Stamp_0',
    'Title': 'Retired Offset Credits',
    'Exclude': ',rhid,ftType,Other Attributes here,Make Public,ahid,',
    'Columns': 'all,Account Holder,Quantity of Offset Credits,FacilityName,Email,Status Effective',
    'Masks': '|||||MM/DD/YYYY',
    'ClassMasks': ',,#.0,,,',
    'Headings': ',,,Project Name,,',
    'Parameters': '',
    'ParametersOriginal': '',
    'SortORder': '',
    'FormatType': 'txt',
    'ReplaceExpression': '',
    'ReplaceValue': '',
}

cookies = {
    'ASPSESSIONIDCGTRQSDS': 'DFDMDAFDFEPACLKJAAPHHBDH',
}

# Get the file
response = requests.post('https://thereserve2.apx.com/myModule/include/rptdownload.asp', cookies=cookies, data=data)

# Look at the file
df = pd.read_table(StringIO(response.text), sep=',', on_bad_lines='warn')
print(df.head())

# Write the file
with open('download.txt', 'wb') as f:
    f.write(response.content)

输出:

   Vintage                 Offset Credit Serial Numbers  Quantity of Offset Credits Status Effective Project ID                                       Project Name                     Project Type Protocol Version         Project Site Location Project Site State Project Site Country  Additional Certification(s) CORSIA Eligible                  Account Holder         Retirement Reason             Retirement Reason Details  Unnamed: 16
0     2021   CAR-1-US-888-4-666-TX-2021-6665-1 to 17444                       17444       12/09/2021     CAR888                           Angelina County Landfill  Landfill Gas Capture/Combustion      Version 3.0                        Lufkin              TEXAS                   US                          NaN              No  Element Markets Emissions, LLC  On Behalf of Third Party                                   NaN          NaN
1     2021   CAR-1-US-1247-37-234-MT-2021-6653-1 to 110                         110       04/20/2022    CAR1247  Bluesource - Carroll Avoided Grassland Convers...     Avoided Grassland Conversion      Version 1.0             Valley County, MT            MONTANA                   US                          NaN              No                     Cool Effect     Environmental Benefit                                   NaN          NaN
2     2021  CAR-1-MX-1282-42-938-PU-2021-6736-1 to 1604                        1604       02/17/2022    CAR1282       Captura de carbono en San Rafael Ixtapalucan                    Forestry - MX      Version 1.5        San Rafael Ixtapalucan             PUEBLA                   MX                          NaN              No                Cultivo Land PBC  On Behalf of Third Party  Meta / Facebook Sustainability Goals          NaN
3     2021     CAR-1-MX-1282-42-938-PU-2021-6734-1 to 5                           5       02/17/2022    CAR1282       Captura de carbono en San Rafael Ixtapalucan                    Forestry - MX      Version 1.5        San Rafael Ixtapalucan             PUEBLA                   MX                          NaN              No                Cultivo Land PBC  On Behalf of Third Party  Meta / Facebook Sustainability Goals          NaN
4     2021   CAR-1-MX-1415-42-938-OA-2021-6719-1 to 213                         213       12/06/2021    CAR1415  Carbono, Agua y Biodiversidad Indígena Capulálpam                    Forestry - MX      Version 2.0  Capulálpam de Méndez, Oaxaca             OAXACA                   MX                          NaN              No                     Cool Effect     Environmental Benefit                                   NaN          NaN