使用 Artoo.js 和 Google Puppeteer 进行网页抓取
Using Artoo.js with Google Puppeteer for Web Scraping
我似乎无法使用 Artoo.js with Puppeteer。
我尝试通过 npm install artoo-js
使用它,但没有用。
我也尝试使用 Puppeteer 命令注入构建路径分布 page.injectFile(filePath)
,但我没有成功。
有人能成功实现这两个库吗?
如果是这样,我会喜欢 Artoo.js 是如何注入的代码片段。
我刚刚尝试了 的 Puppeteer,我想我也可以尝试 Artoo,所以你开始吧:)
(第 0 步:如果没有安装 Yarn)
yarn init
yarn add puppeteer
# Download latest artoo script, not as a yarn dependency here because it won't be by the Node JS runtime
wget https://medialab.github.io/artoo/public/dist/artoo-latest.min.js
将其保存在 index.js
中:
const puppeteer = require('puppeteer');
(async() => {
const url = 'https://news.ycombinator.com/';
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Go to URL and wait for page to load
await page.goto(url, {waitUntil: 'networkidle'});
// Inject Artoo into page's JS context
await page.injectFile('artoo-latest.min.js');
// Sleeping 2s to let Artoo initialize (I don't have a more elegant solution right now)
await new Promise(res => setTimeout(res, 2000))
// Use Artoo from page's JS context
const result = await page.evaluate(() => {
return artoo.scrape('td.title:nth-child(3)', {
title: {sel: 'a'},
url: {sel: 'a', attr: 'href'}
});
});
console.log(`Result has ${result.length} items, first one is:`, result[0]);
browser.close();
})();
结果:
$ node index.js
Result has 30 items, first one is: { title: 'Headless mode in Firefoxdeveloper.mozilla.org',
url: 'https://developer.mozilla.org/en-US/Firefox/Headless_mode' }
这太有趣了,不能错过:现在 HackerNews 的头条文章是关于 Firefox Headless 的...
我似乎无法使用 Artoo.js with Puppeteer。
我尝试通过 npm install artoo-js
使用它,但没有用。
我也尝试使用 Puppeteer 命令注入构建路径分布 page.injectFile(filePath)
,但我没有成功。
有人能成功实现这两个库吗?
如果是这样,我会喜欢 Artoo.js 是如何注入的代码片段。
我刚刚尝试了
(第 0 步:如果没有安装 Yarn)
yarn init
yarn add puppeteer
# Download latest artoo script, not as a yarn dependency here because it won't be by the Node JS runtime
wget https://medialab.github.io/artoo/public/dist/artoo-latest.min.js
将其保存在 index.js
中:
const puppeteer = require('puppeteer');
(async() => {
const url = 'https://news.ycombinator.com/';
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Go to URL and wait for page to load
await page.goto(url, {waitUntil: 'networkidle'});
// Inject Artoo into page's JS context
await page.injectFile('artoo-latest.min.js');
// Sleeping 2s to let Artoo initialize (I don't have a more elegant solution right now)
await new Promise(res => setTimeout(res, 2000))
// Use Artoo from page's JS context
const result = await page.evaluate(() => {
return artoo.scrape('td.title:nth-child(3)', {
title: {sel: 'a'},
url: {sel: 'a', attr: 'href'}
});
});
console.log(`Result has ${result.length} items, first one is:`, result[0]);
browser.close();
})();
结果:
$ node index.js
Result has 30 items, first one is: { title: 'Headless mode in Firefoxdeveloper.mozilla.org',
url: 'https://developer.mozilla.org/en-US/Firefox/Headless_mode' }
这太有趣了,不能错过:现在 HackerNews 的头条文章是关于 Firefox Headless 的...