使用多个输入在 python 中抓取网页

Scraping a webpage in python with a multiple inputs

我需要使用 python 从本网站 https://www.cashbackforex.com/en-US/tools/economic-impacts.aspx 中的 table 获取数据。 到目前为止我写的代码是

from bs4 import BeautifulSoup
import requests

url = 'https://www.cashbackforex.com/en-US/tools/economic-impacts.aspx'

with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}

    # parsing parameters
    response = session.get(url)
    soup = BeautifulSoup(response.content, "lxml")
    print(soup.select('input[type="button"]'))
    data = {
        'dnn$ctr1601$Chart$ddlCurrencies': 'USD',
        'dnn$ctr1601$Chart$ddlReports': 'US Change in NonFarm Payrolls',
        'dnn$ctr1601$Chart$ddlTimeZone': '(UTC) Coordinated Universal Time',
        '__EVENTTARGET': soup.find('input', {'name': '__EVENTTARGET'}).get('value', ''),
        '__EVENTARGUMENT': soup.find('input', {'name': '__EVENTARGUMENT'}).get('value', ''),
        '__VIEWSTATE': soup.find('input', {'name': '__VIEWSTATE'}).get('value', ''),
        '__VIEWSTATEGENERATOR': soup.find('input', {'name': '__VIEWSTATEGENERATOR'}).get('value', ''),
        'btnApplyTools': soup.find('input', {'id': 'btnApplyTools'}).get('value', '')
    }

    # parsing data
    response = session.post(url, data=data)

    soup = BeautifulSoup(response.content, "lxml")
    print(soup)

但每次我 运行 程序我都找不到 table 中的值。我认为程序不会将输入值发送到服务器,但我不确定。

以下table:

我检查了提供的页面,发现 Session() 中不需要并发送多个参数来获取所需的 table。您只需要指定 inst 参数(类似于过滤器的标识符)和 timezone。例如,对于USD/US Change in NonFarm Payrolls inst参数值是10332295timezone对于(UTC) Coordinated Universal Time3.

所以你的请求应该是这样的

params = {'inst': '10332295', 'timezone': '3'}
response = requests.get('https://www.cashbackforex.com/DesktopModules/Chart/HistoricalEventFigures.ashx', params=params)

然后你可以用方便的方式解析response,例如:

from xml.dom import minidom

xml = minidom.parseString(response.text)
print([i.childNodes[0].wholeText for i in xml.getElementsByTagName("Date")])
print([i.childNodes[0].wholeText for i in xml.getElementsByTagName("ReportName")])
...

输出:

['2 Dec 2016', '4 Nov 2016', '7 Oct 2016', '2 Sep 2016', '5 Aug 2016', '8 Jul 2016', '3 Jun 2016',...]
['US Change in NonFarm Payrolls', 'US Change in NonFarm Payrolls', 'US Change in NonFarm Payrolls', 'US Change in NonFarm Payrolls',...]