从 HTML 和 JS 源中提取 URL，一行多个

Question

我想列出我们的源代码引用的所有域，允许仅查找那些静态引用并以 https?:// 开头的域。例如，我尝试了以下操作：

find -s [^.]* -print0 | xargs -0 sed -En 's/.*https?:\/\/([a-z0-9\-\.\_]+).*//p' | sort | uniq

错误是，当一行中有多个域时，只会返回一个。这可以用简单的 shell 工具解决，即不完全解析 HTML?

Answer 1

正则表达式 .* 是贪婪的，因此将它同时放在正则表达式的开头和结尾会丢弃同一行上的任何其他 URL。

标准 grep 无法打印像 ([a-z0-9-._]+) 这样的捕获组，但如果您有 perl，请替换为：

sed -En 's/.*https?:\/\/([a-z0-9\-\.\_]+).*//p'

有了这个：

perl -nle 'print  while m{https?://([a-z0-9-._]+)}g'

您的最终命令将是：

find -s [^.]* -print0 | xargs -0 perl -nle 'print  while m{https?://([a-z0-9-._]+)}g' | sort | uniq

Extract URLs from HTML and JS source, with multiple on a line