检索包括动态加载的完整网页 links/images

Retrieving a complete webpage including dynamically loaded links/images

问题

正在下载可动态加载 links/images 的网站的完整离线工作副本

研究

有问题(例如[1], , [3]) on Whosebug addressing this issue, most of which have the top answers using wget or httrack, both of which fail miserably (please do correct me if I am wrong) on pages that dyanmically load links or uses srcset instead of src for img tag -or anything loaded via JS-. A rather obvious solution was Selenium, however, if you ever used Selenium in production, you quickly start seeing the issues that arise from such a decision (resource heavy, quite complex to use head-full driver, the fact that is it not built for that), that being said, there are people claiming to have been using it easily in production for years

预期解决方案

一个脚本(最好在 python 中),它解析页面的链接并分别加载它们。我似乎无法找到执行此操作的任何现有脚本。如果您的解决方案是 "so implement your own",那么一开始就问这个问题是没有意义的,我正在寻找现有的实现。

例子

  1. Shopify.com
  2. 使用 Wix 构建的网站

现在有 Selenium 的无头版本和 PhantomJS 等替代版本,它们都可以与小脚本一起使用来抓取任何动态加载的网站。

我实现了一个通用的抓取工具here, and explained more about the topic here