通过 AJAX 加载 SPA 网页

Question

我试图通过插入 URL 使用 JavaScript 获取整个网页。但是，该网站是作为单页应用程序 (SPA) 构建的，它使用 JavaScript / backbone.js 在呈现初始响应后动态加载其大部分内容。

例如，当我路由到以下地址时：

https://connect.garmin.com/modern/activity/1915361012

然后将其输入控制台（页面加载后）：

var $page = $("html")
console.log("%c✔: ", "color:green;", $page.find(".inline-edit-target.page-title-overflow").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());

然后我将获取动态加载的 activity 标题以及静态加载的页脚：

但是，当我尝试通过 $.get() or .load() 调用 AJAX 加载网页时，我只收到初始响应（相同作为超过view-source时的内容):

view-source:https://connect.garmin.com/modern/activity/1915361012

因此，如果我使用以下任一 AJAX 调用：

// jQuery.get()
var url = "https://connect.garmin.com/modern/activity/1915361012";
jQuery.get(url,function(data) {
    var $page = $("<div>").html(data)
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim());
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

// jQuery.load()
var url = "https://connect.garmin.com/modern/activity/1915361012";
var $page = $("<div>")
$page.load(url, function(data) {
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim()    );
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

我仍会得到初始页脚，但不会得到任何其他页面内容：

我已经尝试 solution here 到 eval() 每个 script 标签的内容，但是这似乎不够稳健，无法实际加载页面：

jQuery.get(url,function(data) {
    var $page = $("<div>").html(data)
    $page.find("script").each(function() {
        var scriptContent = $(this).html(); //Grab the content of this tag
        eval(scriptContent); //Execute the content
    });
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim());
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

问：是否有任何选项可以完全加载可在 JavaScript 上抓取的网页？

Answer 1

首先：避免 eval - 您的内容安全政策应该阻止它，它会让您容易受到 XSS 攻击。抓取机器人肯定不会运行它。

您描述的问题对所有 SPA 都是常见的 - 当一个人访问时，他们会得到您的应用程序 shell 脚本，然后加载其余内容 - 一切都很好。当机器人访问时，他们会忽略脚本和 return 空 shell.

解决方案是服务器端渲染。一种方法是，如果您在服务器上使用 JS 渲染器（比如 React）和 Node.js，您可以相当轻松地构建 JS 并静态地提供它。

但是，如果您不是，那么您将需要运行服务器上的无头浏览器执行用户将执行的所有 JS，然后将结果提供给机器人。

幸好别人已经done all the work here. They've put a demo online that you can try out with your site:

Answer 2

您永远无法完全复制任意 (SPA) 页面的功能。

我看到的唯一方法是使用无头浏览器，例如 PhantomJS or Headless Chrome, or Headless Firefox。

我想尝试 Headless Chrome 所以让我们看看它能对您的页面做什么：

使用内部 REPL 快速检查

使用 Chrome Headless 加载该页面（您需要 Chrome 59 on Mac/Linux，Chrome 60 on Windows），然后找到页面来自 REPL 的带有 JavaScript 的标题：

% chrome --headless --disable-gpu --repl https://connect.garmin.com/modern/activity/1915361012
[0830/171405.025582:INFO:headless_shell.cc(303)] Type a Javascript expression to evaluate or "quit" to exit.
>>> $('body').find('.page-title').text().trim() 
{"result":{"type":"string","value":"Daily Mile - Round 2 - Day 27"}}

注意：为了让 chrome 命令行在 Mac 上工作，我事先做了这个：

alias chrome="'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'"

以编程方式使用 Node 和 Puppeteer

Puppeteer is a Node library (by Google Chrome developers) which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.

（第 0 步：如果没有安装 Node & Yarn）

在新目录中：

yarn init
yarn add puppeteer

用这个创建 index.js：

const puppeteer = require('puppeteer');
(async() => {
    const url = 'https://connect.garmin.com/modern/activity/1915361012';
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Go to URL and wait for page to load
    await page.goto(url, {waitUntil: 'networkidle'});
    // Wait for the results to show up
    await page.waitForSelector('.page-title');
    // Extract the results from the page
    const text = await page.evaluate(() => {
        const title = document.querySelector('.page-title');
        return title.innerText.trim();
    });
    console.log(`Found: ${text}`);
    browser.close();
})();

结果：

$ node index.js 
Found: Daily Mile - Round 2 - Day 27

Answer 3

我想你应该知道SPA的概念， SPA 是单页应用程序，它只是静态 html 文件。当路由发生变化时，页面会动态创建或修改DOM个节点，使用Javascript.

达到切换页面的效果

因此，如果您使用 $.get()，服务器将响应具有稳定页面的静态 html 文件，因此您不会加载您想要的内容。

如果要使用$.get()，有两种方式，第一种是使用headless browser，例如headless chrome、phantomJS等。 help you load the page 你可以得到 dom nodes of loaded page.The 第二个是 SSR (Server Slide Render), 如果你使用 SSR, 你会得到HTML页面数据直接由$.get，因为请求不同路由时服务器响应HTML对应页面数据

参考：

SSR

vue的SRR框架：Nuxt.js

PhantomJS

Node API of Headless Chrome

通过 AJAX 加载 SPA 网页

Load a SPA webpage via AJAX

javascript

ajax

jquery

jquery-load

single-page-application

使用内部 REPL 快速检查

以编程方式使用 Node 和 Puppeteer