在 Python 中使用 selenium 抓取网站时访问被拒绝
Access denied while scraping a website with selenium in Python
您好,我正在尝试从 Macy's 网站提取信息,特别是从这个类别 =“https://www.macys.com/shop/featured/women-handbags”中提取信息。但是当我访问特定项目页面时,我得到一个包含以下消息的空白页面:
访问被拒绝
您无权访问此服务器上的 "any of the items links listed on the above category link"。
参考#18.14d6f7bd.1526927300.12232a22
我也尝试过使用 chrome 选项更改用户代理,但没有成功。
这是我的代码:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = 'https://www.macys.com/shop/featured/women-handbags'
def init_selenium():
global driver
driver = webdriver.Chrome("/Users/rodrigopeniche/Downloads/chromedriver")
driver.get(url)
def find_page_items():
items_elements = driver.find_elements_by_css_selector('li.productThumbnailItem')
for index, element in enumerate(items_elements):
items_elements = driver.find_elements_by_css_selector('li.productThumbnailItem')
item_link = items_elements[index].find_element_by_tag_name('a').get_attribute('href')
driver.get(item_link)
driver.back()
init_selenium()
find_page_items()
知道发生了什么事吗?我该如何解决?
这不是一个面向硒的解决方案(完全),但它确实有效。你可以试试看。
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
url = 'https://www.macys.com/shop/featured/women-handbags'
def find_page_items(driver,link):
driver.get(link)
item_link = [item.find_element_by_tag_name('a').get_attribute('href') for item in driver.find_elements_by_css_selector('li.productThumbnailItem')]
for newlink in item_link:
res = requests.get(newlink,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
name = soup.select_one("h1[itemprop='name']").text.strip()
print(name)
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
find_page_items(driver,url)
finally:
driver.quit()
输出:
Mercer Medium Bonded-Leather Crossbody
Mercer Large Tote
Nolita Medium Satchel
Voyager Medium Multifunction Top-Zip Tote
Mercer Medium Crossbody
Kelsey Large Crossbody
Medium Mercer Gallery
Mercer Large Center Tote
Signature Raven Large Tote
但是,如果您坚持使用 selenium,那么每次浏览新内容时都需要创建它的新实例 url 或者可能更好的选择是清除缓存。
您好,我正在尝试从 Macy's 网站提取信息,特别是从这个类别 =“https://www.macys.com/shop/featured/women-handbags”中提取信息。但是当我访问特定项目页面时,我得到一个包含以下消息的空白页面:
访问被拒绝 您无权访问此服务器上的 "any of the items links listed on the above category link"。 参考#18.14d6f7bd.1526927300.12232a22
我也尝试过使用 chrome 选项更改用户代理,但没有成功。
这是我的代码:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = 'https://www.macys.com/shop/featured/women-handbags'
def init_selenium():
global driver
driver = webdriver.Chrome("/Users/rodrigopeniche/Downloads/chromedriver")
driver.get(url)
def find_page_items():
items_elements = driver.find_elements_by_css_selector('li.productThumbnailItem')
for index, element in enumerate(items_elements):
items_elements = driver.find_elements_by_css_selector('li.productThumbnailItem')
item_link = items_elements[index].find_element_by_tag_name('a').get_attribute('href')
driver.get(item_link)
driver.back()
init_selenium()
find_page_items()
知道发生了什么事吗?我该如何解决?
这不是一个面向硒的解决方案(完全),但它确实有效。你可以试试看。
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
url = 'https://www.macys.com/shop/featured/women-handbags'
def find_page_items(driver,link):
driver.get(link)
item_link = [item.find_element_by_tag_name('a').get_attribute('href') for item in driver.find_elements_by_css_selector('li.productThumbnailItem')]
for newlink in item_link:
res = requests.get(newlink,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
name = soup.select_one("h1[itemprop='name']").text.strip()
print(name)
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
find_page_items(driver,url)
finally:
driver.quit()
输出:
Mercer Medium Bonded-Leather Crossbody
Mercer Large Tote
Nolita Medium Satchel
Voyager Medium Multifunction Top-Zip Tote
Mercer Medium Crossbody
Kelsey Large Crossbody
Medium Mercer Gallery
Mercer Large Center Tote
Signature Raven Large Tote
但是,如果您坚持使用 selenium,那么每次浏览新内容时都需要创建它的新实例 url 或者可能更好的选择是清除缓存。