使用 bash 使用 wget 下载具有 id 的特定网站文件夹的所有网页

Question

我需要下载包含特定网站文件夹 (/content/) 图像的所有网页。尝试访问该文件夹会出现 403 错误，但页面的所有链接都在 index.html 中。它们都具有相同的模式 "content.php?id=xx"，其中 'xx' 是两位到四位数字中的任意数字。

我想的是下载所有网站并删除除 'content' 文件夹以外的所有内容，这将非常 time/bandwith 消耗，因为这是一个 cronjob，需要运行很多次。其他方法是编写一个 bash 脚本，例如：

wget -k -p http://www.example.com/content/content.php?id{{x}}

假设它是一个 bash 脚本，我如何在 wget 中添加一个变量来下载所有“id”页面（可能使用 for 循环？）？

Answer 1

怎么样

for id in $(seq 99 9999); do
    wget -k -p http://www.example.com/content/content.php?id=$id
done

这假定使用了所有二到四位数字的 ID，否则您会遇到很多错误。

如果有更多信息，可能会有更好的解决方案。

Answer 2

因为有一个索引，理想情况下您可以获得 wget 来跟踪索引中的链接，但只过滤您想要的 URL 而不是搜索整个站点。 curl 无法解析 HTML 并跟踪其中的链接，但 wget 可以。

wget 有 -A / -R accept/reject glob expressions 或 --accept-regex / --reject-regex.

wget -p -k --recursive --level=1 -A '*/content.php?id=*'  http://www.example.com/content/index.php

根据需要调整接受模式以避免抓取整个网站，但仍包括您想要的内容。 wget 对 html 与其他文件类型使用 accept/reject 规则的方式有些复杂，请参阅文档（我链接了它。向下滚动到 accept/reject 模式部分的底部）。

暴力获取的最简单方法是使用 curl 而不是 wget，因为它具有范围表达式。它还将为多个请求重用相同的 HTTP 连接，而不是为每个请求都使用新的 TCP 连接来攻击服务器。（wget 默认也使用 HTTP keep-alive，但显然只有当你在其命令行上放置多个 URL 时它才有效，而不是运行它分别为每个 URL).

curl -L --remote-name-all --compressed --remote-time --fail 'http://www.example.com/content/content.php?id=[00-9999]'

请注意 URL 和范围表达式周围的单引号，因为您需要 curl 才能看到它，而不是 bash 将其视为 glob 或大括号表达式。

--remote-name-all：使用基于远程名称的文件名保存文件，而不是标准输出。较旧的 curl 过去需要为命令行上给出的每个 url 模式提供 -O。
-L：遵循重定向 (--location)。
--fail：在服务器错误（如 404）时静默失败，而不是保存 ErrorDocument。
--compressed: 允许gzip传输编码。
--remote-time: 根据远程mod时间设置本地文件时间戳。

测试了一下是否正确，看起来不错：

$ curl -L --remote-name-all --compressed --remote-time --fail 'http://www.example.com/content/content.php?id=[00-9999]'

[1/10000]: http://www.example.com/content/content.php?id=00 --> content.php?id=00
--_curl_--http://www.example.com/content/content.php?id=00
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (22) The requested URL returned error: 404 Not Found

[2/10000]: http://www.example.com/content/content.php?id=01 --> content.php?id=01
--_curl_--http://www.example.com/content/content.php?id=01
curl: (22) The requested URL returned error: 404 Not Found

[3/10000]: http://www.example.com/content/content.php?id=02 --> content.php?id=02

...

[100/10000]: http://www.example.com/content/content.php?id=99 --> content.php?id=99
--_curl_--http://www.example.com/content/content.php?id=99
curl: (22) The requested URL returned error: 404 Not Found

[101/10000]: http://www.example.com/content/content.php?id=100 --> content.php?id=100
--_curl_--http://www.example.com/content/content.php?id=100
curl: (22) The requested URL returned error: 404 Not Found

...

使用 bash 使用 wget 下载具有 id 的特定网站文件夹的所有网页

Download all webpages of a specific website folder with the id with wget using bash

bash

wget