wget - 如何跳过未找到的文件?

wget - How to skip not found file?

我使用 wget 从 Internet 下载文件并使用 -O 选项以自定义文件名保存图像。有时,找不到文件,返回 404 错误代码。比如我运行这个命令:

wget 'http://www.example.com/path/to/image/file01928.jpg' -O myimagefile.jpg

结果是

 root@localhost:~# wget 'http://www.example.com/path/to/image/file01928.jpg' -O myimagefile.jpg
--2015-09-13 23:11:07--  http://www.example.com/path/to/image/file01928.jpg
Resolving www.example.com (www.example.com)... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946
Connecting to www.example.com (www.example.com)|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2015-09-13 23:11:07 ERROR 404: Not Found.

虽然没有找到文件,但文件仍然保存在我的硬盘中:

root@localhost:~# ls
myimagefile.jpg

有没有办法跳过/取消(不执行命令)找不到文件?我应该使用什么选项?

您可以执行 HEAD 请求以查看资源(图像)是否存在,如果存在,则下载它。您可以 运行 wget with -S 来打印 headers 和 --spider 来检查,但不下载资源。

来自man wget

  -S
  --server-response
      Print the headers sent by HTTP servers and responses sent by FTP servers.

  --spider
      When invoked with this option, Wget will behave as a Web spider, which means that
      it will not download the pages, just check that they are there.  For example, you
      can use Wget to check your bookmarks:

              wget --spider --force-html -i bookmarks.html

      This feature needs much more work for Wget to get close to the functionality of
      real web spiders.

这是一个例子:

#!/bin/bash

URL='http://www.google.com'
echo "Checking $URL"
if wget -S --spider $URL 2>&1 | grep -q 'Remote file exists'; then
    echo "Found $URL, going to fetch it"
    wget $URL -O google.html;
else
    echo 'Url $URL does not exist!'
fi

URL='http://www.example.com/path/to/image/file01928.jpg'
echo "Checking $URL"
if wget -S --spider $URL 2>&1 | grep -q 'Remote file exists'; then
    echo "Found $URL, going to fetch it"
    wget $URL -O myimagefile.jpg;
else
    echo "Url $URL does not exist!"
fi

输出

Checking http://www.google.com
Found http://www.google.com, going to fetch it
--2015-09-14 05:26:34--  http://www.google.com/
Resolving www.google.com (www.google.com)... 74.125.239.144, 74.125.239.145, 74.125.239.146, ...
Connecting to www.google.com (www.google.com)|74.125.239.144|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘google.html’

    [ <=>                                                    ] 18,684      --.-K/s   in 0.001s

2015-09-14 05:26:34 (13.9 MB/s) - ‘google.html’ saved [18684]

Checking http://www.example.com/path/to/image/file01928.jpg
Url http://www.example.com/path/to/image/file01928.jpg does not exist!