URL 无法识别 BeautifulSoup 中的网页

Question

我正在使用 Python 和 Selenium 来尝试从某个搜索页面的结果页面抓取所有链接。无论我在上一个屏幕中搜索什么，结果页面上任何搜索的 URL 都是：“https://chem.nlm.nih.gov/chemidplus/ProxyServlet” 如果我使用 Selenium 自动搜索，然后尝试将 URL 读入 BeautifulSoup，我会得到 HTTPError: HTTP Error 404: Not Found

这是我的代码：

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv


# create a new Firefox session
driver = webdriver.Firefox()
# wait 3 seconds for the page to load
driver.implicitly_wait(3)

# navigate to ChemIDPlus Website
driver.get("https://chem.nlm.nih.gov/chemidplus/")
#implicit wait 10 seconds for drop-down menu to load
driver.implicitly_wait(10)

#open drop-down menu QV7 ("Route:")
select=Select(driver.find_element_by_name("QV7"))
#select "inhalation" in QV7
select.select_by_visible_text("inhalation")
#identify submit button

search="/html/body/div[2]/div/div[2]/div/div[2]/form/div[1]/div/span/button[1]" =15=]

#click submit button
driver.find_element_by_xpath(search).click()

#increase the number of results per page
select=Select(driver.find_element_by_id("selRowsPerPage"))
select.select_by_visible_text("25")
#wait 3 seconds
driver.implicitly_wait(3)

#identify current search page...HERE IS THE ERROR, I THINK
url1="https://chem.nlm.nih.gov/chemidplus/ProxyServlet"
page1=urlopen(url1)
#read the search page
soup=BeautifulSoup(page1.content, 'html.parser')

我怀疑这与代理服务器有关，并且 Python 没有收到识别网站所需的信息，但我不确定如何解决这个问题。提前致谢！

Answer 1

我使用 Selenium 来识别新的 URL 作为识别正确搜索页面的变通方法： url1=driver.current_url 接下来，我使用 requests 获取内容并将其提供给 beautifulsoup。总之，我补充说：

#Added to the top of the script
import requests
...
#identify the current search page with Selenium
url1=driver.current_url
#scrape the content of the results page
r=requests.get(url)
soup=BeautifulSoup(r.content, 'html.parser')
...

URL 无法识别 BeautifulSoup 中的网页

Unable to Identify Webpage in BeautifulSoup by URL

python

selenium

proxy-server