WGET

Question

新年快乐！

我想看看是否有人成功地从网站的 .txt 文件中包含的多个 url 下载了嵌入式 pdf 文件？

例如；

我尝试了 wget -i urlist.txt 的几种组合（完美下载了所有 html 文件）；但是它也不会获取每个 html 文件的嵌入式 .pdf?xxxxx <---- .pdf 末尾的 slug?*

此障碍的具体示例如下：

这个数据集我已经将 link 的所有 2 页放入 url.txt:

https://law.justia.com/cases/washington/court-of-appeals-division-i/2014/

此数据集中的 1 个示例 URL：

https://law.justia.com/cases/washington/court-of-appeals-division-i/2014/70147-9.html

嵌入的 pdf link 如下：

https://cases.justia.com/washington/court-of-appeals-division-i/2014-70147-9.pdf?ts=1419887549

.pdf文件实际上是“2014-70147-9.pdf?ts=1419887549”.pdf?ts=xxxxxxxxxx

每一个都不一样。

URL 列表包含 795 个 link。有没有人有成功的方法来下载我的 urls.txt 中的每个 .html，同时还下载 .pdfxxxxxxxxxxxxxx 文件也与 .html 一起下载？

谢谢！

~布兰登

Answer 1

您正在寻找网络抓取工具。如果您使用规则，请注意不要违反任何规则。

您还可以在 bash 脚本中使用一些字符串操作来处理通过 wget 收到的内容。

Answer 2

尝试使用以下方法：

wget --level 1 --recursive --span-hosts --accept-regex 'https://law.justia.com/cases/washington/court-of-appeals-division-i/2014/.*html|https://cases.justia.com/washington/court-of-appeals-division-i/.*.pdf.*' --input-file=urllist.txt

有关选项 --level、--recursive、--span-hosts、--accept-regex 和 --input-file 的详细信息，请参阅 wget 文档 https://www.gnu.org/software/wget/manual/html_node/index.html.

您还需要了解正则表达式的工作原理。您可以从 https://www.grymoire.com/Unix/Regular.html

开始

WGET - 如何从文本文件 URL 列表中下载具有下载按钮的嵌入式 pdf？可能吗？

WGET - how to download embedded pdf's that have a download button from a text file URL list? Is it possible?

linux

pdf

web-scraping