如何使用 wget 下载与模式匹配的所有 URL

Question

假设我有这样一个网站：

https://mywebsite.com/dir1/id-1
https://mywebsite.com/dir1/id-2
https://mywebsite.com/dir1/id-3
https://mywebsite.com/dir2/foo-id-1
https://mywebsite.com/dir2/foo-id-2
https://mywebsite.com/dir2/foo-id-3
https://mywebsite.com/dir3/list-1
https://mywebsite.com/dir3/list-2
https://mywebsite.com/dir3/list-...
https://mywebsite.com/dir3/list-n
https://mywebsite.com/dir4/another-list-type-1
https://mywebsite.com/dir4/another-list-type-2
https://mywebsite.com/dir4/another-list-type-...
https://mywebsite.com/dir4/another-list-type-n
https://mywebsite.com/random-other-directories-i-dont-care-about...

我想下载所有 /dir1/:id 和 /dir2/foo-:id 页面，但想点击 /dir1 到 /dir4 中所有页面的链接，其中例如，其中一些目录只是 /dir/:id 的链接列表。

想知道我是怎么做到的。理想情况下，它将最大限度地下载所有 :id 链接首先，而不是首先下载数千或数百万列表页面。

想知道如何做到这一点。不只是简单的"mirror the site"。很多时候，当我尝试这样做时，wget 过度专注于我不关心的链接。我希望它最大程度地下载 /dir1/:id 和 /dir2/foo-:id，同时收集它在遇到的其他页面上找到的任何链接。基本上，一些确定优先级的方法。

Answer 1

所以您既不需要广度优先也不需要深度优先的方法，而是使用某种优先级概念的方法。

不幸的是，仅使用 Wget 是不可能的。但是，通过一些 bash 脚本编写，您可能会非常接近。我可以想到两种简单的方法：

首先将 link 提供给 /dir1/ 和 /dir2/，然后递归下载。完成后，使用 mywebsite.com/ 调用 wget 以下载其余文件。它会浪费几秒钟发送 HEAD 对您已下载的所有文件的请求，但仅此而已。
这与上面的（1）类似。除了，您为每个目录调用带有``--accept-regex`的wget，导致它们一个接一个地下载

如何使用 wget 下载与模式匹配的所有 URL

How to use wget to download all URLs matching a pattern

wget