使用多个输入在 python 中抓取网页
Scraping a webpage in python with a multiple inputs
我需要使用 python 从本网站 https://www.cashbackforex.com/en-US/tools/economic-impacts.aspx 中的 table 获取数据。
到目前为止我写的代码是
from bs4 import BeautifulSoup
import requests
url = 'https://www.cashbackforex.com/en-US/tools/economic-impacts.aspx'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
# parsing parameters
response = session.get(url)
soup = BeautifulSoup(response.content, "lxml")
print(soup.select('input[type="button"]'))
data = {
'dnn$ctr1601$Chart$ddlCurrencies': 'USD',
'dnn$ctr1601$Chart$ddlReports': 'US Change in NonFarm Payrolls',
'dnn$ctr1601$Chart$ddlTimeZone': '(UTC) Coordinated Universal Time',
'__EVENTTARGET': soup.find('input', {'name': '__EVENTTARGET'}).get('value', ''),
'__EVENTARGUMENT': soup.find('input', {'name': '__EVENTARGUMENT'}).get('value', ''),
'__VIEWSTATE': soup.find('input', {'name': '__VIEWSTATE'}).get('value', ''),
'__VIEWSTATEGENERATOR': soup.find('input', {'name': '__VIEWSTATEGENERATOR'}).get('value', ''),
'btnApplyTools': soup.find('input', {'id': 'btnApplyTools'}).get('value', '')
}
# parsing data
response = session.post(url, data=data)
soup = BeautifulSoup(response.content, "lxml")
print(soup)
但每次我 运行 程序我都找不到 table 中的值。我认为程序不会将输入值发送到服务器,但我不确定。
以下table:
我检查了提供的页面,发现 Session()
中不需要并发送多个参数来获取所需的 table。您只需要指定 inst
参数(类似于过滤器的标识符)和 timezone
。例如,对于USD
/US Change in NonFarm Payrolls
inst
参数值是10332295
,timezone
对于(UTC) Coordinated Universal Time
是3
.
所以你的请求应该是这样的
params = {'inst': '10332295', 'timezone': '3'}
response = requests.get('https://www.cashbackforex.com/DesktopModules/Chart/HistoricalEventFigures.ashx', params=params)
然后你可以用方便的方式解析response
,例如:
from xml.dom import minidom
xml = minidom.parseString(response.text)
print([i.childNodes[0].wholeText for i in xml.getElementsByTagName("Date")])
print([i.childNodes[0].wholeText for i in xml.getElementsByTagName("ReportName")])
...
输出:
['2 Dec 2016', '4 Nov 2016', '7 Oct 2016', '2 Sep 2016', '5 Aug 2016', '8 Jul 2016', '3 Jun 2016',...]
['US Change in NonFarm Payrolls', 'US Change in NonFarm Payrolls', 'US Change in NonFarm Payrolls', 'US Change in NonFarm Payrolls',...]
我需要使用 python 从本网站 https://www.cashbackforex.com/en-US/tools/economic-impacts.aspx 中的 table 获取数据。 到目前为止我写的代码是
from bs4 import BeautifulSoup
import requests
url = 'https://www.cashbackforex.com/en-US/tools/economic-impacts.aspx'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
# parsing parameters
response = session.get(url)
soup = BeautifulSoup(response.content, "lxml")
print(soup.select('input[type="button"]'))
data = {
'dnn$ctr1601$Chart$ddlCurrencies': 'USD',
'dnn$ctr1601$Chart$ddlReports': 'US Change in NonFarm Payrolls',
'dnn$ctr1601$Chart$ddlTimeZone': '(UTC) Coordinated Universal Time',
'__EVENTTARGET': soup.find('input', {'name': '__EVENTTARGET'}).get('value', ''),
'__EVENTARGUMENT': soup.find('input', {'name': '__EVENTARGUMENT'}).get('value', ''),
'__VIEWSTATE': soup.find('input', {'name': '__VIEWSTATE'}).get('value', ''),
'__VIEWSTATEGENERATOR': soup.find('input', {'name': '__VIEWSTATEGENERATOR'}).get('value', ''),
'btnApplyTools': soup.find('input', {'id': 'btnApplyTools'}).get('value', '')
}
# parsing data
response = session.post(url, data=data)
soup = BeautifulSoup(response.content, "lxml")
print(soup)
但每次我 运行 程序我都找不到 table 中的值。我认为程序不会将输入值发送到服务器,但我不确定。
以下table:
我检查了提供的页面,发现 Session()
中不需要并发送多个参数来获取所需的 table。您只需要指定 inst
参数(类似于过滤器的标识符)和 timezone
。例如,对于USD
/US Change in NonFarm Payrolls
inst
参数值是10332295
,timezone
对于(UTC) Coordinated Universal Time
是3
.
所以你的请求应该是这样的
params = {'inst': '10332295', 'timezone': '3'}
response = requests.get('https://www.cashbackforex.com/DesktopModules/Chart/HistoricalEventFigures.ashx', params=params)
然后你可以用方便的方式解析response
,例如:
from xml.dom import minidom
xml = minidom.parseString(response.text)
print([i.childNodes[0].wholeText for i in xml.getElementsByTagName("Date")])
print([i.childNodes[0].wholeText for i in xml.getElementsByTagName("ReportName")])
...
输出:
['2 Dec 2016', '4 Nov 2016', '7 Oct 2016', '2 Sep 2016', '5 Aug 2016', '8 Jul 2016', '3 Jun 2016',...]
['US Change in NonFarm Payrolls', 'US Change in NonFarm Payrolls', 'US Change in NonFarm Payrolls', 'US Change in NonFarm Payrolls',...]