如何使用 Python 请求访问转发的网页？

Question

如何使用 Python 请求访问以下网页？

https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page?FundId=10306

页面被转发，直到我单击 2 个 "Accept" 按钮。

我就是这样做的：

import requests
s = requests.Session()
r = s.post("https://www.fidelity.com.hk/investor/en/important-notice.page?submit=true&componentID=1298599783876")
r = s.get("https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page?&FundId=10306")

如何处理第一个 "Accept" 按钮，我检查过有一个名为 "Accepted" 的 cookie，我说得对吗？:

<a id="terms_use_accept" class="btn btn-default standard-btn smallBtn" title="Accept" href="javascript:void(0);">Accept</a>

Answer 1

您无法使用 requests 或 urllib 模块来处理 JavaScript。但根据我的知识（不多）我会告诉你我将如何解决这个问题。

本网站使用特定的 cookie 来了解您是否已经接受了他们的政策。如果没有，服务器会将您重定向到上图中显示的页面。使用一些附加组件查找该 cookie 并手动设置它，以便网站向您显示您正在查找的内容。

另一种方法是使用 Qt 的内置 Web 浏览器（使用 WebKit），它可以让您执行 JavaScript 代码。只需使用 evaluateJavaScript("agree();") 就可以了。

希望对您有所帮助。

Answer 2

首先，requests 不是浏览器，也没有内置JavaScript引擎。

但是，您可以通过检查单击 "Accept" 时浏览器中发生的事情来模拟非依赖逻辑。这就是 浏览器开发工具 很方便。

如果您在第一个 Accept/Decline "popup" 中单击 "Accept" - 将设置一个 "accepted=true" cookie。至于第二个 "Accept"，这是按钮 link 在源代码中的样子：

<a href="javascript:agree()">
    <img src="/static/images/investor/en/buttons/accept_Btn.jpg" alt="Accept" title="Accept">
</a>

如果单击按钮 agree() 函数被调用。这是它的作用：

function agree() {
    $("form[name='agreeFrom']").submit();
}

换句话说，正在提交 agreeFrom 表单。这个形式是隐藏的，但是你可以在源代码中找到它：

<form name="agreeFrom" action="/investor/en/important-notice.page?submit=true&amp;componentID=1298599783876" method="post">
    <input value="Agree" name="iwPreActions" type="hidden">
    <input name="TargetPageName" type="hidden" value="en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends">
    <input type="hidden" name="FundId" value="10306">
</form>

我们可以使用 requests 提交此表格。但是，有一个更简单的选择。如果您单击 "Accept" 并检查设置了哪些 cookie，您会注意到除了 "accepted" 之外还有 4 个新的 cookie 设置：

"irdFundId" 具有来自 "FundId" 表单输入的 "FundId" 值或来自请求的 URL 的值（参见“?FundId=10306”）
"isAgreed=yes"
"isExpand=true"
"lastAgreedTime" 带有时间戳

让我们使用此信息构建一个使用 requests+BeautifulSoup 的解决方案（对于 HTML 解析部分）：

import time

from bs4 import BeautifulSoup
import requests
from requests.cookies import cookiejar_from_dict


fund_id = '10306'
last_agreed_time = str(int(time.time() * 1000))
url = 'https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page'

with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'}
    session.cookies = cookiejar_from_dict({
        'accepted': 'true',
        'irdFundId': fund_id,
        'isAgreed': 'yes',
        'isExpand': 'true',
        'lastAgreedTime': last_agreed_time
    })

    response = session.get(url, params={'FundId': fund_id})

    soup = BeautifulSoup(response.content)
    print soup.title

它打印：

Fidelity Funds - America Fund A-USD| Fidelity

这意味着我们看到了所需的页面。

Answer 3

您也可以使用名为 selenium:

的浏览器自动化工具来处理它

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()  # could also be headless: webdriver.PhantomJS()
driver.get('https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page?FundId=10306')

# switch to the popup
frame = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe.cboxIframe")))
driver.switch_to.frame(frame)

# click accept
accept = driver.find_element_by_link_text('Accept')
accept.click()

# switch back to the main window
driver.switch_to.default_content()

# click accept
accept = driver.find_element_by_xpath('//a[img[@title="Accept"]]')
accept.click()

# wait for the page title to load
WebDriverWait(driver, 10).until(EC.title_is("Fidelity Funds - America Fund A-USD| Fidelity"))

# TODO: extract the data from the page

如何使用 Python 请求访问转发的网页？

How to reach a forwarded webpage using Python Requests?

python

web-scraping

python-requests