使用 Python 抓取 .aspx 页面会产生 404

Scraping .aspx page with Python yields 404

我是 web-scraping 初学者,正在尝试抓取此网页:https://profiles.doe.mass.edu/statereport/ap.aspx

我希望能够在顶部设置一些设置(例如学区,2020-2021,计算机科学 A,女性),然后下载这些设置的结果数据。

这是我目前使用的代码:

import requests
from bs4 import BeautifulSoup

url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
    r = s.get('https://profiles.doe.mass.edu/statereport/ap.aspx')
    soup = BeautifulSoup(r.text,"lxml")
    data = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    
    
    data["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
    data["ctl00$ContentPlaceHolder1$ddYear"]="2021",
    data["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
    data["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",
    
    p = s.post(url,data=data)

当我打印 p.text 时,我得到一个标题为 '\t404 - Page Not Found\r\n' 和消息为

的页面
<h2>We are unable to locate information at: <br /><br '
 '/>http://profiles.doe.mass.edu:80/statereport/ap.aspxp?ASP.NET_SessionId=bxfgao54wru50zl5tkmfml00</h2>\r\n'

这是 data 在我修改之前的样子:

{'__EVENTVALIDATION': '/wEdAFXz4796FFICjJ1Xc5ZOd9SwSHUlrrW+2y3gXxnnQf/b23Vhtt4oQyaVxTPpLLu5SKjKYgCipfSrKpW6jkHllWSEpW6/zTHqyc3IGH3Y0p/oA6xdsl0Dt4O8D2I0RxEvXEWFWVOnvCipZArmSoAj/6Nog6zUh+Jhjqd1LNep6GtJczTu236xw2xaJFSzyG+xo1ygDunu7BCYVmh+LuKcW56TG5L0jGOqySgRaEMolHMgR0Wo68k/uWImXPWE+YrUtgDXkgqzsktuw0QHVZv7mSDJ31NaBb64Fs9ARJ5Argo+FxJW/LIaGGeAYoDphL88oao07IP77wrmH6t1R4d88C8ImDHG9DY3sCDemvzhV+wJcnU4a5qVvRziPyzqDWnj3tqRclGoSw0VvVK9w+C3/577Gx5gqF21UsZuYzfP4emcqvJ7ckTiBk7CpZkjUjM6Z9XchlxNjWi1LkzyZ8QMP0MaNCP4CVYJfndopwFzJC7kI3W106YIA/xglzXrSdmq6/MDUCczeqIsmRQGyTOkQFH724RllsbZyHoPHYvoSAJilrMQf6BUERVN4ojysx3fz5qZhZE7DWaJAC882mXz4mEtcevFrLwuVPD7iB2v2mlWoK0S5Chw4WavlmHC+9BRhT36jtBzSPRROlXuc6P9YehFJOmpQXqlVil7C9OylT4Kz5tYzrX9JVWEpeWULgo9Evm+ipJZOKY2YnC41xTK/MbZFxsIxqwHA3IuS10Q5laFojoB+e+FDCqazV9MvcHllsPv2TK3N1oNHA8ODKnEABoLdRgumrTLDF8Lh+k+Y4EROoHhBaO3aMppAI52v3ajRcCFET22jbEm/5+P2TG2dhPhYgtZ8M/e/AoXht29ixVQ1ReO/6bhLIM+i48RTmcl76n1mNjfimB8r3irXQGYIEqCkXlUHZ/SNlRYyx3obJ6E/eljlPveWNidFHOaj+FznOh264qDkMm7fF78WBO2v0x+or1WGijWDdQtRy9WRKXchYxUchmBlYm15YbBfMrIB7+77NJV+M6uIVVnCyiDRGj+oPXcTYxqSUCLrOMQyzYKJeu8/hWD0gOdKeoYUdUUJq4idIk+bLYy76sI/N2aK+aXZo/JPQ+23gTHzIlyi4Io7O6kXaULPs8rfo8hpkH1qXyKb/rP2VJBNWgyp8jOMx9px+m4/e2Iecd86E4eN4Rk6OIiwqGp+dMdgntXu5ruRHb1awPlVmDw92dL1P0b0XxJW7EGfMzyssMDhs1VT6K6iMUTHbuXkNGaEG1dP1h4ktnCwGqDLVutU6UuzT6i4nfqnvFjGK9+7Ze8qWIl8SYyhmvzmgpLjdMuF9CYMQ2Aa79HXLKFACsSSm0dyiU1/ZGyII2Fvga9o+nVV1jZam3LkcAPaXEKwEyJXfN/DA7P4nFAaQ+QP+2bSgrcw+/dw+86OhPyG88qyJwqZODEXE1WB5zSOUywGb1/Xed7wq9WoRs6v8rAK5c/2iH7YLiJ4mUVDo+7WCKrzO5+Hsyah3frMKbheY1acRmSVUzRgCnTx7jvcLGR9Jbt6TredqZaWZBrDFcntdg7EHd7imK5PqjUld3iCVjdyO+yLKUkMKiFD85G3vEferg/Q/TtfVBqeTU0ohP9d+CsKOmV/dxVYWEtBcfa9KiN6j4N8pP7+3iUOhajojZ8jV98kxT0zPZlzkpqI4SwR6Ys8d2RjIi5K+oQul4pL5u+zZvX0lsLP9Jl7FeVTfBvST67T6ohz8dl9gBfmmbwnT23SyuFSUGd6ZGaKE+9kKYmuImW7w3ePs7C70yDWHpIpxP/IJ4GHb36LWto2g3Ld3goCQ4fXPu7C4iTiN6b5WUSlJJsWGF4eQkJue8=',
 '__VIEWSTATE': '/wEPDwUKLTM0NzY4OTQ4NmRkDwwPzTpuna+yxVhQxpRF4n2+zYKQtotwRPqzuCkRvyU=',
 '__VIEWSTATEGENERATOR': '2B6F8D71',
 'ctl00$ContentPlaceHolder1$btnViewReport': 'View Report',
 'ctl00$ContentPlaceHolder1$hfExport': 'ViewReport',
 'leftNavId': '11241',
 'quickSearchValue': '',
 'runQuickSearch': 'Y',
 'searchType': 'QUICK',
 'searchtext': ''}

根据类似问题的建议,我尝试使用参数,以各种方式编辑 data(以模拟导航时在浏览器中看到的 POST 请求我自己的网站),并指定了 ASP.NET_SessionId,但无济于事。

如何从该网站访问信息?

这应该是你要找的我所做的是使用 bs4 解析 HTML 数据然后找到 table。然后我得到这些行,为了更容易处理我把它放入字典中的数据。

import requests
from bs4 import BeautifulSoup


url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
    r = s.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    table = soup.find_all('table')
    rows = table[0].find_all('tr')
    data = {}
    for row in rows:
        if row.find_all('th'):
            keys = row.find_all('th')
            for key in keys:
                data[key.text] = []
        else:
            values = row.find_all('td')
            for value in values:
                data[keys[values.index(value)].text].append(value.text)

for key in data:
    print(key, data[key][:10])
    print('\n')

输出:

District Name ['Abington', 'Academy Of the Pacific Rim Charter Public (District)', 'Acton-Boxborough', 'Advanced Math and Science Academy Charter (District)', 'Agawam', 'Amesbury', 'Amherst-Pelham', 'Andover', 'Arlington', 'Ashburnham-Westminster']


District Code ['00010000', '04120000', '06000000', '04300000', '00050000', '00070000', '06050000', '00090000', '00100000', '06100000']


Tests Taken ['     100', '     109', '   1,070', '     504', '     209', '     126', '     178', '     986', '     893', '      97']


Score=1 ['      16', '      81', '      12', '      29', '      27', '      18', '       5', '      70', '      72', '       4']


Score=2 ['      31', '      20', '      55', '      74', '      65', '      34', '      22', '     182', '     149', '      23']


Score=3 ['      37', '       4', '     158', '     142', '      55', '      46', '      37', '     272', '     242', '      32']


Score=4 ['      15', '       3', '     344', '     127', '      39', '      19', '      65', '     289', '     270', '      22']


Score=5 ['       1', '       1', '     501', '     132', '      23', '       9', '      49', '     173', '     160', '      16']


% Score 1-2 ['  47.0', '  92.7', '   6.3', '  20.4', '  44.0', '  41.3', '  15.2', '  25.6', '  24.7', '  27.8']


% Score 3-5 ['  53.0', '   7.3', '  93.7', '  79.6', '  56.0', '  58.7', '  84.8', '  74.4', '  75.3', '  72.2']



Process finished with exit code 0

我能够通过修改 here 中的代码来实现此功能。我不确定为什么 以这种方式编辑有效载荷会有所作为,所以我将不胜感激任何见解!

这是我的工作代码,使用 Pandas 解析表格:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
    
    response = s.get(url)
    soup = BeautifulSoup(response.content, 'html5lib')

    data = { tag['name']: tag['value'] 
        for tag in soup.select('input[name^=ctl00]') if tag.get('value')
            }
    state = { tag['name']: tag['value'] 
        for tag in soup.select('input[name^=__]')
            }
    
    payload = data.copy()
    payload.update(state)
    
    payload["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
    payload["ctl00$ContentPlaceHolder1$ddYear"]="2021",
    payload["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
    payload["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",
    
    p = s.post(url,data=payload)
    df = pd.read_html(p.text)[0]
    
    df["District Code"] = df["District Code"].astype(str).str.zfill(8)
    display(df)