在 Python 中使用 selenium 抓取网站时访问被拒绝

Access denied while scraping a website with selenium in Python

您好,我正在尝试从 Macy's 网站提取信息,特别是从这个类别 =“https://www.macys.com/shop/featured/women-handbags”中提取信息。但是当我访问特定项目页面时,我得到一个包含以下消息的空白页面:

访问被拒绝 您无权访问此服务器上的 "any of the items links listed on the above category link"。 参考#18.14d6f7bd.1526927300.12232a22

我也尝试过使用 chrome 选项更改用户代理,但没有成功。

这是我的代码:

import sys
reload(sys)
sys.setdefaultencoding('utf8')
from selenium import webdriver 
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

url = 'https://www.macys.com/shop/featured/women-handbags'

def init_selenium():
    global driver
    driver = webdriver.Chrome("/Users/rodrigopeniche/Downloads/chromedriver")
    driver.get(url)

def find_page_items():
    items_elements = driver.find_elements_by_css_selector('li.productThumbnailItem')
    for index, element in enumerate(items_elements):
    items_elements = driver.find_elements_by_css_selector('li.productThumbnailItem')
    item_link = items_elements[index].find_element_by_tag_name('a').get_attribute('href')
    driver.get(item_link)
    driver.back()


init_selenium()
find_page_items()

知道发生了什么事吗?我该如何解决?

这不是一个面向硒的解决方案(完全),但它确实有效。你可以试试看。

from selenium import webdriver 
import requests
from bs4 import BeautifulSoup

url = 'https://www.macys.com/shop/featured/women-handbags'

def find_page_items(driver,link):
    driver.get(link)
    item_link = [item.find_element_by_tag_name('a').get_attribute('href') for item in driver.find_elements_by_css_selector('li.productThumbnailItem')]
    for newlink in item_link:
        res = requests.get(newlink,headers={"User-Agent":"Mozilla/5.0"})
        soup = BeautifulSoup(res.text,"lxml")
        name = soup.select_one("h1[itemprop='name']").text.strip()
        print(name)

if __name__ == '__main__':
    driver = webdriver.Chrome()
    try:
        find_page_items(driver,url)
    finally:
        driver.quit()

输出:

Mercer Medium Bonded-Leather Crossbody
Mercer Large Tote
Nolita Medium Satchel
Voyager Medium Multifunction Top-Zip Tote
Mercer Medium Crossbody
Kelsey Large Crossbody
Medium Mercer Gallery
Mercer Large Center Tote
Signature Raven Large Tote

但是,如果您坚持使用 selenium,那么每次浏览新内容时都需要创建它的新实例 url 或者可能更好的选择是清除缓存。