Python Selenium 数据未加载（网站安全）

Question

请在下面找到我试图用来 download/scrape“csv”文件的代码。代码是测试的第一阶段，它失败了，即使没有错误。 --数据没有加载到gecko中driver

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

driver = webdriver.Firefox(executable_path="C:\Py378\prj14\geckodriver.exe")

driver.get("https://www.nseindia.com/market-data/live-equity-market")
time.sleep(5)

element_dorpdown = Select(driver.find_element_by_id("equitieStockSelect"))
element_dorpdown.select_by_index(44)   #Updated with help of @PDHide in the comments
time.sleep(5)

代码执行正常，但是由于网站的安全设置，与选项相关的数据没有加载，当我手动select更新选项时，table没有更新, 就好像没有制造 selection 一样。（也许它开始了解它的 selenium driver，并且需要 headers，但不确定...）此外，当我尝试单击“以 CSV 格式下载”时，它会超时。

我需要下载 F&O 的 csv，在选项 selected 成功后（如上所示）...请帮助...

我可以在普通浏览器（已安装）上浏览该网站，但是当我使用 python(selenium) 时，它在那些浏览器上就失败了……请问如何 by-pass 安全？ ??

Answer 1

我尝试执行代码（使用 Chrome，但这应该无关紧要）或者我应该说，稍微改变一下，以便我可以更好地了解发生了什么（请注意，我使用 implicitly_wait 而不是 sleep，后者浪费时间）。这里我只是尝试select第二个选项：

from selenium import webdriver
from selenium.webdriver.support.ui import Select

options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)

try:
    driver.implicitly_wait(3) # wait up to 3 seconds before calls to find elements time out
    driver.get("https://www.nseindia.com/market-data/live-equity-market")
    select = Select(driver.find_element_by_id("equitieStockSelect"))
    select.select_by_index(1)
finally:
    input('pausing...')
    driver.quit()

如您所见，我 select 第二个选项没问题。但是，新的 table 加载失败：

此时我在页面上手动发出重新加载，得到以下结果。我的结论是该网站检测到浏览器正在运行自动化并阻止访问：

更新

因此可以使用 requests 检索数据。我使用 Chrome 检查器查看网络 XHR 请求，然后我 select 编辑了第二个选项 (NIFTY NEXT 50) 并观察了正在发出的 AJAX 请求：

在这种情况下，URL 是：https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%20NEXT%2050。但是，您必须首先使用 requests Session 实例获取初始页面：

import requests

try:
    s = requests.Session()
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'}
    s.headers.update(headers)
    # You have to first retrieve the initial page:
    resp = s.get('https://www.nseindia.com/market-data/live-equity-market')
    resp.raise_for_status()
    #print(resp.text)
    resp = s.get('https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%20NEXT%2050')
    resp.raise_for_status()
    data = resp.json()
    print(data)
except Exception as e:
    print(e)

打印：

{'name': 'NIFTY NEXT 50', 'advance': {'declines': '25', 'advances': '24', 'unchanged': '1'}, 'timestamp': '27-Nov-2020 16:00:00', 'data': [{'priority': 1, 'symbol': 'NIFTY NEXT 50', 'identifier': 'NIFTY NEXT 50', 'open': 30316.45,  etc. (data too long) }

更新 2

一般来说，要计算 URL 您需要获取任何索引，例如索引 44，查看该索引的相应选项值，在本例中为 'Securities in F&O' 并将其替换为以下程序中的变量 option_value:

from urllib.parse import quote_plus

option_value = 'SECURITIES IN F&O'

url = 'https://www.nseindia.com/api/equity-stockIndices?index=' + quote_plus(option_value)
print(url)

打印：

https://www.nseindia.com/api/equity-stockIndices?index=SECURITIES+IN+F%26O

上面的URL是要使用的值

Python Selenium 数据未加载（网站安全）

Python Selenium Data does not load (website Security)

python-3.x

selenium

drop-down-menu

geckodriver