在 R 中抓取 Javascript 渲染的网页引用外部 javascript 脚本
Scraping Javascript-rendered webpage that references external javascript scripts in R
我正在尝试抓取此网页:https://www.mustardbet.com/sports/events/302698
由于网页似乎是动态呈现的,所以我按照本教程进行操作:
https://www.datacamp.com/community/tutorials/scraping-javascript-generated-data-with-r#gs.dZEqev8
按照教程的建议,我使用以下代码保存了一个名为 "scrape_mustard.js" 的文件:
// scrape_mustard.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'mustard.html'
page.open('https://www.mustardbet.com/sports/events/302698', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
然后,我执行
system("./phantomjs scrape_mustard.js")
但我收到错误消息:
ReferenceError: Can't find variable: Set
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
现在,当我将“https://www.mustardbet.com/assets/js/index.dfd873fb.js”粘贴到我的浏览器时,我可以看到它是 javascript,我可能需要
(1) 将其另存为文件,或
(2) 将其包含在 scrape_mustard.js 中。
但是如果 (1),我不知道如何引用那个新文件,如果 (2),我不知道如何正确定义所有 javascript 以便它可以被使用。
我是javascript的新手,但也许这道题并不太难?
感谢您的帮助!
我能够使用 js 模块进行抓取 puppeteer.js
。
下载node.js
here。 node.js
附带 npm
,这让您在安装模块时更轻松。您需要使用 npm
安装 puppeteer。
在 RStudio 中,确保在安装 puppeteer.js
时位于工作目录中。安装 node.js
后,执行 (source):
system("npm i puppeteer")
scrape_mustard.js
:
// load modules
const fs = require("fs");
const puppeteer = require("puppeteer");
// page url
url = "https://www.mustardbet.com/sports/events/302698";
scrape = async() => {
const browser = await puppeteer.launch({headless: false}); // open browser
const page = await browser.newPage(); // open new page
await page.goto(url, {waitUntil: "networkidle2", timeout: 0}); // go to page
await page.waitFor(5000); // give it time to load all the javascript rendered content
const html = await page.content(); // copy page contents
browser.close(); // close chromium
return html // return html object
};
scrape().then((value) => {
fs.writeFileSync("./Whosebug/page.html", value) // write the object being returned by scrape()
});
至 运行 scrape_mustard.js
在 R
:
library(magrittr)
system("node ./Whosebug/scrape_mustard.js")
html <- xml2::read_html("./Whosebug/page.html")
oddsMajor <- html %>%
rvest::html_nodes(".odds-major")
betNames <- html %>%
rvest::html_nodes("h3")
控制台输出:
{xml_nodeset (60)}
[1] <span class="odds-major">2</span>
[2] <span class="odds-major">14</span>
[3] <span class="odds-major">15</span>
[4] <span class="odds-major">16</span>
[5] <span class="odds-major">17</span>
[6] <span class="odds-major">23</span>
[7] <span class="odds-major">25</span>
[8] <span class="odds-major">32</span>
[9] <span class="odds-major">33</span>
[10] <span class="odds-major">39</span>
[11] <span class="odds-major">47</span>
[12] <span class="odds-major">54</span>
[13] <span class="odds-major">55</span>
[14] <span class="odds-major">58</span>
[15] <span class="odds-major">58</span>
[16] <span class="odds-major">64</span>
[17] <span class="odds-major">73</span>
[18] <span class="odds-major">73</span>
[19] <span class="odds-major">92</span>
[20] <span class="odds-major">98</span>
...
> betNames
{xml_nodeset (60)}
[1] <h3>Charles Howell III</h3>\n
[2] <h3>Brian Harman</h3>\n
[3] <h3>Austin Cook</h3>\n
[4] <h3>J.J. Spaun</h3>\n
[5] <h3>Webb Simpson</h3>\n
[6] <h3>Cameron Champ</h3>\n
[7] <h3>Peter Uihlein</h3>\n
[8] <h3>Seung-Jae Im</h3>\n
[9] <h3>Nick Watney</h3>\n
[10] <h3>Graeme McDowell</h3>\n
[11] <h3>Zach Johnson</h3>\n
[12] <h3>Lucas Glover</h3>\n
[13] <h3>Corey Conners</h3>\n
[14] <h3>Luke List</h3>\n
[15] <h3>David Hearn</h3>\n
[16] <h3>Adam Schenk</h3>\n
[17] <h3>Kevin Kisner</h3>\n
[18] <h3>Brian Gay</h3>\n
[19] <h3>Patton Kizzire</h3>\n
[20] <h3>Brice Garnett</h3>\n
...
我相信 phantomjs
可以做到,但我发现 puppeteer
更容易抓取 javascript 呈现的网页。还要记住 phantomjs
is no longer being developed.
我正在尝试抓取此网页:https://www.mustardbet.com/sports/events/302698
由于网页似乎是动态呈现的,所以我按照本教程进行操作: https://www.datacamp.com/community/tutorials/scraping-javascript-generated-data-with-r#gs.dZEqev8
按照教程的建议,我使用以下代码保存了一个名为 "scrape_mustard.js" 的文件:
// scrape_mustard.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'mustard.html'
page.open('https://www.mustardbet.com/sports/events/302698', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
然后,我执行
system("./phantomjs scrape_mustard.js")
但我收到错误消息:
ReferenceError: Can't find variable: Set
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
现在,当我将“https://www.mustardbet.com/assets/js/index.dfd873fb.js”粘贴到我的浏览器时,我可以看到它是 javascript,我可能需要 (1) 将其另存为文件,或 (2) 将其包含在 scrape_mustard.js 中。
但是如果 (1),我不知道如何引用那个新文件,如果 (2),我不知道如何正确定义所有 javascript 以便它可以被使用。
我是javascript的新手,但也许这道题并不太难?
感谢您的帮助!
我能够使用 js 模块进行抓取 puppeteer.js
。
下载node.js
here。 node.js
附带 npm
,这让您在安装模块时更轻松。您需要使用 npm
安装 puppeteer。
在 RStudio 中,确保在安装 puppeteer.js
时位于工作目录中。安装 node.js
后,执行 (source):
system("npm i puppeteer")
scrape_mustard.js
:
// load modules
const fs = require("fs");
const puppeteer = require("puppeteer");
// page url
url = "https://www.mustardbet.com/sports/events/302698";
scrape = async() => {
const browser = await puppeteer.launch({headless: false}); // open browser
const page = await browser.newPage(); // open new page
await page.goto(url, {waitUntil: "networkidle2", timeout: 0}); // go to page
await page.waitFor(5000); // give it time to load all the javascript rendered content
const html = await page.content(); // copy page contents
browser.close(); // close chromium
return html // return html object
};
scrape().then((value) => {
fs.writeFileSync("./Whosebug/page.html", value) // write the object being returned by scrape()
});
至 运行 scrape_mustard.js
在 R
:
library(magrittr)
system("node ./Whosebug/scrape_mustard.js")
html <- xml2::read_html("./Whosebug/page.html")
oddsMajor <- html %>%
rvest::html_nodes(".odds-major")
betNames <- html %>%
rvest::html_nodes("h3")
控制台输出:
{xml_nodeset (60)}
[1] <span class="odds-major">2</span>
[2] <span class="odds-major">14</span>
[3] <span class="odds-major">15</span>
[4] <span class="odds-major">16</span>
[5] <span class="odds-major">17</span>
[6] <span class="odds-major">23</span>
[7] <span class="odds-major">25</span>
[8] <span class="odds-major">32</span>
[9] <span class="odds-major">33</span>
[10] <span class="odds-major">39</span>
[11] <span class="odds-major">47</span>
[12] <span class="odds-major">54</span>
[13] <span class="odds-major">55</span>
[14] <span class="odds-major">58</span>
[15] <span class="odds-major">58</span>
[16] <span class="odds-major">64</span>
[17] <span class="odds-major">73</span>
[18] <span class="odds-major">73</span>
[19] <span class="odds-major">92</span>
[20] <span class="odds-major">98</span>
...
> betNames
{xml_nodeset (60)}
[1] <h3>Charles Howell III</h3>\n
[2] <h3>Brian Harman</h3>\n
[3] <h3>Austin Cook</h3>\n
[4] <h3>J.J. Spaun</h3>\n
[5] <h3>Webb Simpson</h3>\n
[6] <h3>Cameron Champ</h3>\n
[7] <h3>Peter Uihlein</h3>\n
[8] <h3>Seung-Jae Im</h3>\n
[9] <h3>Nick Watney</h3>\n
[10] <h3>Graeme McDowell</h3>\n
[11] <h3>Zach Johnson</h3>\n
[12] <h3>Lucas Glover</h3>\n
[13] <h3>Corey Conners</h3>\n
[14] <h3>Luke List</h3>\n
[15] <h3>David Hearn</h3>\n
[16] <h3>Adam Schenk</h3>\n
[17] <h3>Kevin Kisner</h3>\n
[18] <h3>Brian Gay</h3>\n
[19] <h3>Patton Kizzire</h3>\n
[20] <h3>Brice Garnett</h3>\n
...
我相信 phantomjs
可以做到,但我发现 puppeteer
更容易抓取 javascript 呈现的网页。还要记住 phantomjs
is no longer being developed.