如何检查网页是否包含 X 然后使用 wget 获取它们的 URL

Question

我想抓取一个网站，如果在 HTML 中找到一些文本或匹配模式，获取页面的 URL(s)。

写了命令

wget --recursive --spider site.com 2>&1 | sort | uniq | grep -oe 'www[^ ]*'

获取目前所有的 URLs，但仍然无法解决如何仅输出那些具有指定文本的 URLs 的问题。有什么线索吗？

Answer 1

爬取一个网站，如果在 HTML

中找到一些文本或匹配模式

wget --spider 这是不可能的。 wget manual 表示当您使用 --spider

When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there. For example, you can use Wget to check your bookmarks:

wget --spider --force-html -i bookmarks.html

This feature needs much more work for Wget to get close to the functionality of real web spiders.

wget 与 --spider 选项确实获取响应 headers，您可以按照以下方式打印

wget --spider --server-response http://www.example.com

这将包含有关文件的信息，例如 Content-Length 通知文件大小，但不通知文件内容本身。

如何检查网页是否包含 X 然后使用 wget 获取它们的 URL

How to check if webpages contain X and then get their URL using wget

bash

awk

grep

wget