在 heritrix 中找到网络列表的网络跟踪

find web trace to a web list in heritrix

我最近在我工作的公司使用网络爬虫 Heritrix,经过一段时间的搜索和测试,我找不到解决我们需求的方法。

我们希望每天在 cron 中 运行 heritrix 自动抓取网页列表,我们想要做的是检查是否有任何 link 网站指向我们域中的网站列表。困难的部分是将所有跟踪记录到指向我们的一个域的 link。

因为作业的日志文件存储了所有 link 的一些信息,但没有跟踪。一个例子是 运行 一个脚本,当工作完成时 grep brazzers 是列表中的一个域,所以如果它在抓取日志中找到 "brazzers" 它应该在另一个日志中显示结果从头到尾跟踪:

2015-10-25T20:18:58.369Z 200 91 http://cdn1.ads.brazzers.com/robots.txt XLEP http://cdn1.ads.brazzers.com/ text/plain #021 20151025201857643+726 sha1:CPA63O5POU3CVLCH3VDDIMBJCCWRVLPC - -

是否可以这样做?或其他方式?。感觉这玩意很蠢,而且我编程也不是很好

非常感谢您

恩里克。

实际上有一种方法可以在爬网作业完成时分析其最终日志。感谢 heritrix 开发者 (https://groups.yahoo.com/neo) 的回复,我现在有了获取网络踪迹的规则 link:

The fourth field of a line in the crawl.log is the URI that was downloaded. The sixth field of the line tells you the URI that referred (directly preceded) the downloaded URI given in the fourth field. So generally, if you find "ourdomain" in the fourth field of a line, then you take the URI in the sixth field of that line and look for that as a fourth field in the crawl.log, you can find its referrer and follow back in this pattern until you hit a seed URI. You should know when you get to a seed URI because the sixth field will have a "-" instead of a URI (the discovery path given in the fifth field will also be a "-").

In this way you can get the particular path that this crawl instance took from the seed to "ourdomain", though there may be multiple other paths existing that the crawler did not take in this instance.

有了这个,整理日志文件中的行以构建网络 link 跟踪的一种方法是创建一个片段,例如 PHP 作为示例,遵循给定的规则