检索图像在与浏览器一起使用时会出现 403 错误

Question

您好，我正在尝试构建一个漫画下载器应用程序，因此我正在抓取多个站点，但是一旦我获得图像 URL，我就遇到了问题。我可以使用我的浏览器 (chrome) 查看图像，我也可以下载它，但是我不能使用任何流行的脚本库来做同样的事情。

这是我试过的方法：

String imgSrc = "https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg"
Connection.Response resultImageResponse = Jsoup.connect(imgSrc)
                    .userAgent(
                            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
                    .referrer("none").execute();

// output here
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(new java.io.File(String.valueOf(imgPath))));
out.write(resultImageResponse.body());          // resultImageResponse.body() is where the image's contents are.
out.close();

我也试过这个：

URL imgUrl = new URL(imgSrc);
Files.copy(imgUrl.openStream(), imgPath);

最后，因为我确定 link 有效，所以我尝试使用 python 下载图像，但在这种情况下我也遇到了 403 错误

import requests
base_url = "https://cdn.mangaeden.com/mangasimg/d0/d08f07d762acda8a1f004677ab2414b9766a616e20bd92de4e2e44f1.jpg"
res = requests.get(url)

谷歌搜索我发现这个这似乎非常接近我的问题，但是我不明白我是否设置了错误的引荐来源网址或者它根本不起作用...

你有什么建议吗？谢谢！

Answer 1

如何修复？

在您的请求中添加一些“headers”以表明您可能是“浏览器”，这会给您一个 200 作为响应，您可以保存文件。

注意 这也适用于邮递员，只需覆盖隐藏的用户代理，您将获得图像作为响应

示例 (python)

import requests
headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
url = "https://cdn.mangaeden.com/mangasimg/d0/d08f07d762acda8a1f004677ab2414b9766a616e20bd92de4e2e44f1.jpg"
res = requests.get(url,headers=headers)
with open("image.jpg", 'wb') as f:
        f.write(res.content)

Answer 2

有人写了这个答案，后来删了，所以我把答案复制下来以备不时之需。

AFAIK, you can't download anything else apart from HTML Documents using jsoup.

If you open up Developer Tools on your browser, you can get the exact request the browser has made. With Chrome, it's something like this.

The minimal cURL request would in your case be:
'https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg'
\   -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21
(KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21' \   --output
image.jpg;

You can refer to HedgeHog's answer for a sample Python solution; here's how to achieve the same in Java using the new HTTP Client:

import java.net.URI; import java.net.http.HttpClient; import
java.net.http.HttpRequest; import
java.net.http.HttpResponse.BodyHandlers; import java.nio.file.Path;
import java.nio.file.Paths;

public class ImageDownload {
    public static void main(String[] args) throws Exception {
        HttpClient client = HttpClient.newHttpClient();
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create("https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg"))
            .header("user-agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0
Safari/535.21")
            .build();
        client.send(request, BodyHandlers.ofFile(Paths.get("image.jpg")));
    } }

我在我的 java 代码中采用了这个解决方案。另外，最后一点，如果下载了图片但打不开，可能是请求中出现了503错误码，这种情况下你只需要重新执行请求即可。您可以识别损坏的图像，因为图像 reader 会说类似

的内容

Not a JPEG file: starts with 0x3c 0x68

这是 <h，一个 HTML 错误页面而不是图像

检索图像在与浏览器一起使用时会出现 403 错误

Retrieving an image gives 403 error while it works with browser

python

java

beautifulsoup

jsoup

python-requests

如何修复？

示例 (python)