grep 所有 Gentoo Stage3 链接到终端

Question

我想将 https://www.gentoo.org/downloads/mirrors/ 中的所有 link 显示到终端。

首先，脚本会将网页 wget 保存到名为 index.html 的文件中，然后 grep 或 sed 命令将简单地显示所有 https://、http:// 和 ftp:// 到终端。

有人可以帮我这个命令吗？我知道这很简单，但我对这两个命令都不熟悉。

我尝试了什么： grep "<code>" index.html

输出：

              <a href="ftp://mirrors.tera-byte.com/pub/gentoo"><code>ftp://mirrors.tera-byte.com/pub/gentoo</code></a>
              <a href="http://gentoo.mirrors.tera-byte.com/"><code>http://gentoo.mirrors.tera-byte.com/</code></a>
              <a href="rsync://mirrors.tera-byte.com/gentoo"><code>rsync://mirrors.tera-byte.com/gentoo</code></a>

如何删除 link 之后的空格、标签和所有不必要的文本？

Answer 1

您可以将 grep 与此模式一起使用：

grep -Po "(?<=<code>)(https?|ftp)(.*)(?=<\/code>)" index.html

前 3 行输出：

ftp://mirrors.tera-byte.com/pub/gentoo
http://gentoo.mirrors.tera-byte.com/
ftp://mirror.csclub.uwaterloo.ca/gentoo-distfiles/

Answer 2

如果您只想保留域 link，可以试试这个 grep

grep -Eo '[h|f]t*ps?://.[^<|>|"]*' index.html

这将仅显示 http、https 和 ftp 个匹配项

如果需要在 <code> 个块内进行匹配，这个 sed 就可以了

sed -En '/<code>/ {s|.*([h|f]t*ps?://.[^<|>|"]*).*||p}' index.html

grep 所有 Gentoo Stage3 链接到终端

grep all Gentoo Stage3 links to the terminal

grep

sed

wget