使用 urlretrieve 抓取的图像作为 HTML 页面

Question

我正在尝试使用 urllib.urlretrieve 抓取 this image。

>>> import urllib
>>> urllib.urlretrieve('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg', 
        path) # path was previously defined

此代码成功将文件保存在给定路径中。但是，当我尝试打开文件时，我得到：

Could not load image 'imagename.jpg':
    Error interpreting JPEG image file (Not a JPEG file: starts with 0x3c 0x21)

当我在 bash 终端中执行 file imagename.jpg 时，我得到 imagefile.jpg: HTML document, ASCII text。

那么如何将此图像抓取为 JPEG 文件？

Answer 1

这是因为托管该图像的服务器的所有者故意阻止来自 Python 的 urllib 的访问。这就是它使用 requests 的原因。您也可以使用纯 Python 来完成它，但是您必须给它一个 HTTP User-Agent header 使其看起来不像 urllib。例如：

import urllib2
req = urllib2.Request('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg')
req.add_header('User-Agent', 'Feneric Was Here')
resp = urllib2.urlopen(req)
imgdata = resp.read()
with open(path, 'wb') as outfile:
    outfile.write(imgdata)

所以出行有点麻烦，但还算不错。

请注意，网站所有者这样做可能是因为有些人受到了辱骂。请不要成为他们中的一员！能力越大，责任越大。

使用 urlretrieve 抓取的图像作为 HTML 页面

Image scraped as HTML page with urlretrieve

python

web-scraping

urllib