使用 python 绕过 selenium 中的 Referral Denied 错误

Question

我正在制作一个脚本来从 comic naver 下载图像，我已经完成了，但是我似乎无法保存图像。我通过 urlib 和 BeasutifulSoup 成功抓取了图像，现在，好像他们已经引入了热链接阻止，我似乎无法通过 urlib 或 selenium 将图像保存在我的系统上。

更新：我尝试更改用户代理以查看这是否会导致问题...还是一样。

任何修复或解决方案？

我现在的代码：

import requests
from bs4 import BeautifulSoup
import re
import urllib
import urllib2
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException


dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Chrome/15.0.87"
)

url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver = webdriver.PhantomJS(desired_capabilities=dcap)

soup = BeautifulSoup(urllib.urlopen(url).read())
scripts = soup.findAll('img', alt='comic content')

for links in scripts:
    Imagelinks = links['src']
    filename = Imagelinks.split('_')[-1]
    print 'Downloading Image : '+filename
    driver.get(Imagelinks)
    driver.save_screenshot(filename)


driver.close()

在“MAI”的回复之后，我尝试了 selenium，并得到了我想要的。现在已经解决了。我的代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains



driver = webdriver.Chrome()
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver.get(url)

elem = driver.find_elements_by_xpath("//div[@class='wt_viewer']//img[@alt='comic content']")

for links in elem:
    print links.get_attribute('src')


driver.quit()

但是，当我尝试截屏时，它显示 "element is not attached to the page"。现在，我该如何解决这个问题：/

Answer 1

我使用 Chrome 开发工具浏览了该网站。

建议大家直接下载图片，不要截图。 Selenium webdriver 实际上应该运行 PhantomJS 无头浏览器上的 javascripts，所以你应该在以下路径中获取由 javascript 加载的图像。

我通过观察html得到的路径是

html body #wrap #container #content div #comic_view_area div img

最后一层的图片标签有content_image_N、N等ID，从0开始计数。所以你也可以用img#content_image_0来获取具体的图片。[=14] =]

Answer 2

（注：抱歉，我无法发表评论，所以我必须回答这个问题。）

为了回答您最初的问题，我刚刚能够通过添加 Referer: http://www.webtoons.com header 从 Naver Webtoons（英文网站）下载 cURL 中的图像，如下所示：

curl -H "Referer: http://www.webtoons.com" [link to image] > img.jpg

我没试过，但您可能想改用 http://comic.naver.com。要使用 urllib 执行此操作，请创建一个具有 header 要求的请求 object：

req = urllib.request.Request(url, headers={"Referer": "http://comic.naver.com"})
with urllib.request.urlopen(req) as response, open("image.jpg", "wb") as outfile:

然后您可以使用shutil.copyfileobj(src, dest)保存文件。因此，您无需截取屏幕截图，只需获取要下载的所有图像的列表，然后使用引用 header.

为每个图像发出请求

编辑： 我有一个 working script on GitHub 只需要 urllib 和 BeautifulSoup.

使用 python 绕过 selenium 中的 Referral Denied 错误

Bypass Referral Denied error in selenium using python

python

selenium

urllib

beautifulsoup

python-2.7