从远程页面获取网址，然后下载到txt文件

Question

我尝试了很多建议但我找不到解决方案（我不知道是否可行）我使用 Ubuntu 15.04

的终端

我需要从我的网站下载所有内部和外部链接的文本文件。com/links_（所有链接都以 links_ 开头）例如 http://www.mywebsite.com/links_sony.aspx我不需要所有其他链接前。 mywebsite.com/index.aspx 或 conditions.asp 等。我使用 wget --spider --recursive --no-verbose --output-file="links.csv" <a href="http://www.mywebsite.com" rel="nofollow">http://www.mywebsite.com</a>

你能帮帮我吗？提前致谢

Answer 1

如果您不介意使用一些其他工具来哄骗 wget，那么您可以试试这个 bash 使用 awk、grep、wget 和 lynx 的脚本：

#! /bin/bash
lynx --dump  | awk '/http/{print }' | grep  > /tmp/urls.txt
for i in $( cat /tmp/urls.txt ); do wget $i; done

将上面的脚本保存为 getlinks，然后运行将其保存为

./getlinks 'http://www.mywebsite.com' 'links_' > mycollection.txt

这种方法没有load/need太多其他工具；而是重用常用的工具。

根据您使用的shell，您可能需要使用引号。以上在标准 bash 中工作，不依赖于这些工具的特定版本。

您可以自定义部分

do wget

使用适当的开关来满足您的特定需求，例如递归、蜘蛛、冗长等。在 wget 和 $1 之间插入这些开关。

从远程页面获取网址，然后下载到txt文件

Get URLs from a remote page and then download to txt file

url

wget