如何正确解析 CSV 以便 Puppeteer 将 CSV 行中的字符串填充到网站上的文本输入?
How to parse CSV correctly for Puppeteer to fill strings from CSV lines to text input on website?
我正在尝试学习 js/puppeteer 并通过构建一个简单的网络抓取工具来抓取书籍信息以用于教育目的。我正在尝试让网络抓取工具将 CSV 文件中的 UPC 编号填充到图书网站的搜索栏中。如果我使用单个 UPC 号码,我设法获得了一个网络抓取工具来抓取网站。
但我有一个包含 UPC 列表的 CSV,并且会喜欢网络抓取工具:
- 读取 CSV 文件,
- 从第一行获取 UPC,
- 在网站上搜索 UPC,
- 抓取信息,
- 从第 2 行获取 UPC,
- 重复 3、4
CSV 样本:
DATE,QUANTITY,NAME,CODECONTENT,CODETYPE
2021-10-13 20:16:44 +1100,1,"Book 1","9781250035288",9
2021-10-13 20:16:40 +1100,1,"Book 2","9781847245601",9
2021-10-13 20:16:35 +1100,1,"Book 3","9780007149247",9
2021-10-13 20:16:30 +1100,1,"Book 4","9780749958084",9
2021-10-13 20:16:26 +1100,1,"Book 5","9781405920384",9
到目前为止,这是我的代码。我被卡在 CSV 解析器的异步函数中,当我执行
时它给我一个 undefined 结果
console.log(allupcs);
另外我不确定如何获得
await page.type('#book-search-form > div.el-wrap.header-search-el-wrap > input.text-input','9781509847556');
接受 UPC
查看下面的代码:
const puppeteer = require('puppeteer');
const parse = require('csv-parser');
const fs = require('fs');
async function getupcs(){
var upcData=[];
fs.createReadStream('Book_Bulk.csv')
.pipe(parse({delimiter: ':'}))
.on('data', function(csvrow) {
// console.log(+csvrow.CODECONTENT);
//do something with csvrow
upcData.push(+csvrow.CODECONTENT);
})
.on('end',function() {
//do something with csvData
// return upcData;
console.log(upcData);
});
}
async function main(){
// const allupcs = await upcData();
// console.log(allupcs);
const browser = await puppeteer.launch({ headless: false, defaultViewport: null, args: ['--start-maximized']});
const page = await browser.newPage();
await page.goto('https://www.bookdepository.com/');
await page.type('#book-search-form > div.el-wrap.header-search-el-wrap > input.text-input','9781509847556');
await page.click('#book-search-form > div.el-wrap.header-search-el-wrap > button');
//Title
await page.waitForSelector('.item-info h1');
const title = await page.$eval('.item-info h1', h1 => h1.textContent);
//Author
await page.waitForSelector('div.author-info.hidden-md > span > a > span');
const author = await page.$eval('div.author-info.hidden-md > span > a > span', span => span.innerText);
//Genre
await page.waitForSelector('.active a');
const genre = await page.$eval('.active a', a => a.innerText);
//Format
await page.waitForSelector('.item-info li');
const format = await page.$eval('.item-info li', li => li.innerText);
//Publisher
await page.waitForSelector('div.biblio-wrap > div > ul > li:nth-child(4) > span > a > span');
const publisher = await page.$eval('div.biblio-wrap > div > ul > li:nth-child(4) > span > a > span', span => span.innerText);
//Year
await page.waitForSelector('div.biblio-wrap > div > ul > li:nth-child(3) > span');
const year = await page.$eval('div.biblio-wrap > div > ul > li:nth-child(3) > span', span => span.innerText);
const newyear = year.slice(-4)
// Price
try {
await page.waitForSelector('div.price.item-price-wrap.hidden-xs.hidden-sm > span', { timeout: 1000 });
const price = await page.$eval('div.price.item-price-wrap.hidden-xs.hidden-sm > span', span => span.innerText);
var newprice = price.slice(-6);
} catch {
await page.waitForSelector('p.list-price');
const price = await page.$eval('p.list-price', p => p.innerText);
var newprice = price.slice(-6);
} finally {
await page.waitForSelector('div.price.item-price-wrap.hidden-xs.hidden-sm > span.sale-price');
const price = await page.$eval('div.price.item-price-wrap.hidden-xs.hidden-sm > span.sale-price', span => span.innerText);
var newprice = price.slice(-6);
}
console.log(title);
console.log(author);
console.log(genre);
console.log(format);
console.log(publisher);
console.log(newyear);
console.log(newprice);
// return {
// title: title,
// author: author,
// genre: genre,
// format: format,
// publisher: publisher,
// year: newyear,
// price: newprice
// }
}
main();
已更新:使用 Answer
中的代码
const puppeteer = require('puppeteer');
const parse = require('csv-parser');
const fs = require('fs');
async function getpageData(page,upc){
await page.goto('https://www.bookdepository.com/');
await page.type('#book-search-form > div.el-wrap.header-search-el-wrap > input.text-input',upc);
await page.click('#book-search-form > div.el-wrap.header-search-el-wrap > button');
//Title
await page.waitForSelector('.item-info h1');
const title = await page.$eval('.item-info h1', h1 => h1.textContent);
//Author
await page.waitForSelector('div.author-info.hidden-md > span > a > span');
const author = await page.$eval('div.author-info.hidden-md > span > a > span', span => span.innerText);
//Genre
await page.waitForSelector('.active a');
const genre = await page.$eval('.active a', a => a.innerText);
//Format
await page.waitForSelector('.item-info li');
const format = await page.$eval('.item-info li', li => li.innerText);
//Publisher
await page.waitForSelector('div.biblio-wrap > div > ul > li:nth-child(4) > span > a > span');
const publisher = await page.$eval('div.biblio-wrap > div > ul > li:nth-child(4) > span > a > span', span => span.innerText);
//Year
await page.waitForSelector('div.biblio-wrap > div > ul > li:nth-child(3) > span');
const year = await page.$eval('div.biblio-wrap > div > ul > li:nth-child(3) > span', span => span.innerText);
const newyear = year.slice(-4)
// Price
try {
await page.waitForSelector('div.price.item-price-wrap.hidden-xs.hidden-sm > span', { timeout: 1000 });
const price = await page.$eval('div.price.item-price-wrap.hidden-xs.hidden-sm > span', span => span.innerText);
var newprice = price.slice(-6);
} catch {
await page.waitForSelector('p.list-price');
const price = await page.$eval('p.list-price', p => p.innerText);
var newprice = price.slice(-6);
} finally {
await page.waitForSelector('div.price.item-price-wrap.hidden-xs.hidden-sm > span.sale-price');
const price = await page.$eval('div.price.item-price-wrap.hidden-xs.hidden-sm > span.sale-price', span => span.innerText);
var newprice = price.slice(-6);
}
// console.log(title);
// console.log(author);
// console.log(genre);
// console.log(format);
// console.log(publisher);
// console.log(newyear);
// console.log(newprice);
return {
title: title,
author: author,
genre: genre,
format: format,
publisher: publisher,
year: newyear,
price: newprice
}
};
function readCsvAsync(filename, delimiter=',', encoding='utf-8') {
return new Promise((resolve, reject) => {
const rows = [];
try {
fs.createReadStream(filename, {encoding: encoding})
.pipe(parse({delimiter: delimiter}))
.on('data', (row) => rows.push(+row.CODECONTENT))
.on('end', () => resolve(rows))
.on('error', reject);
} catch (err) {
reject(err);
}
});
}
async function upcData() {
try {
const rows = await readCsvAsync('Book_Bulk.csv', ':');
// console.log(csvData);
// call puppeteer or whatever
return rows;
} catch (err) {
console.log(err);
}
}
async function main(){
const allupcs = await upcData();
// console.log(allupcs);
const browser = await puppeteer.launch({ headless: false, defaultViewport: null, args: ['--start-maximized']});
const page = await browser.newPage();
const scrapedData = [];
for(let upc of allupcs){
const data = await getpageData(page,upc);
scrapedData.push(data);
}
console.log(scrapedData);
}
main();
如您所见,CSV 解析器是异步的。 “异步”意味着你不能这样做:
var upcData=[]; // 1
fs.createReadStream('Book_Bulk.csv') // 2
.pipe(parse({delimiter: ':'}))
.on('data', (csvrow) { // 5 6 7 8 9
upcData.push(+csvrow.CODECONTENT);
})
.on('end',function() { // 10
console.log(upcData);
});
}
console.log(upcData); // 3
// call puppeteer or whatever // 4
我已经概述了执行顺序。最后一个 console.log()
在您设置读取流后 立即 运行。 upcData
此时将不包含任何内容。
但它将包含点 #10 的数据,#5 等将填充它。
这意味着:无论您想用 upcData
做什么,都请在 'end'
事件处理程序中进行。
.on('end',function() { // 10
console.log(upcData);
for (let upc of upcData) {
// call puppeteer or whatever
}
});
由于 csv reader 将为每个 data
事件提供一行,您也可以直接在 data
事件处理程序中执行操作,而不是构建 upcData
数组完全没有。
.on('data', (csvrow) { // 5 6 7 8 9
const upc = +csvrow.CODECONTENT;
// call puppeteer or whatever
})
如果你想能够await
整件事,你必须先把它变成一个承诺。在这种情况下,相关步骤(承诺解决)再次发生在 end
回调中:
function readCsvAsync(filename, delimiter=',', encoding='utf-8') {
return new Promise((resolve, reject) => {
const rows = [];
try {
fs.createReadStream(filename, {encoding: encoding})
.pipe(parse({delimiter: delimiter}))
.on('data', (row) => rows.push(row))
.on('end', () => resolve(rows))
.on('error', reject);
} catch (err) {
reject(err);
}
});
}
async function main() {
try {
const rows = await readCsvAsync('Book_Bulk.csv', ':');
// call puppeteer or whatever
} catch (err) {
console.log(err);
}
}
我正在尝试学习 js/puppeteer 并通过构建一个简单的网络抓取工具来抓取书籍信息以用于教育目的。我正在尝试让网络抓取工具将 CSV 文件中的 UPC 编号填充到图书网站的搜索栏中。如果我使用单个 UPC 号码,我设法获得了一个网络抓取工具来抓取网站。
但我有一个包含 UPC 列表的 CSV,并且会喜欢网络抓取工具:
- 读取 CSV 文件,
- 从第一行获取 UPC,
- 在网站上搜索 UPC,
- 抓取信息,
- 从第 2 行获取 UPC,
- 重复 3、4
CSV 样本:
DATE,QUANTITY,NAME,CODECONTENT,CODETYPE
2021-10-13 20:16:44 +1100,1,"Book 1","9781250035288",9
2021-10-13 20:16:40 +1100,1,"Book 2","9781847245601",9
2021-10-13 20:16:35 +1100,1,"Book 3","9780007149247",9
2021-10-13 20:16:30 +1100,1,"Book 4","9780749958084",9
2021-10-13 20:16:26 +1100,1,"Book 5","9781405920384",9
到目前为止,这是我的代码。我被卡在 CSV 解析器的异步函数中,当我执行
时它给我一个 undefined 结果console.log(allupcs);
另外我不确定如何获得
await page.type('#book-search-form > div.el-wrap.header-search-el-wrap > input.text-input','9781509847556');
接受 UPC
查看下面的代码:
const puppeteer = require('puppeteer');
const parse = require('csv-parser');
const fs = require('fs');
async function getupcs(){
var upcData=[];
fs.createReadStream('Book_Bulk.csv')
.pipe(parse({delimiter: ':'}))
.on('data', function(csvrow) {
// console.log(+csvrow.CODECONTENT);
//do something with csvrow
upcData.push(+csvrow.CODECONTENT);
})
.on('end',function() {
//do something with csvData
// return upcData;
console.log(upcData);
});
}
async function main(){
// const allupcs = await upcData();
// console.log(allupcs);
const browser = await puppeteer.launch({ headless: false, defaultViewport: null, args: ['--start-maximized']});
const page = await browser.newPage();
await page.goto('https://www.bookdepository.com/');
await page.type('#book-search-form > div.el-wrap.header-search-el-wrap > input.text-input','9781509847556');
await page.click('#book-search-form > div.el-wrap.header-search-el-wrap > button');
//Title
await page.waitForSelector('.item-info h1');
const title = await page.$eval('.item-info h1', h1 => h1.textContent);
//Author
await page.waitForSelector('div.author-info.hidden-md > span > a > span');
const author = await page.$eval('div.author-info.hidden-md > span > a > span', span => span.innerText);
//Genre
await page.waitForSelector('.active a');
const genre = await page.$eval('.active a', a => a.innerText);
//Format
await page.waitForSelector('.item-info li');
const format = await page.$eval('.item-info li', li => li.innerText);
//Publisher
await page.waitForSelector('div.biblio-wrap > div > ul > li:nth-child(4) > span > a > span');
const publisher = await page.$eval('div.biblio-wrap > div > ul > li:nth-child(4) > span > a > span', span => span.innerText);
//Year
await page.waitForSelector('div.biblio-wrap > div > ul > li:nth-child(3) > span');
const year = await page.$eval('div.biblio-wrap > div > ul > li:nth-child(3) > span', span => span.innerText);
const newyear = year.slice(-4)
// Price
try {
await page.waitForSelector('div.price.item-price-wrap.hidden-xs.hidden-sm > span', { timeout: 1000 });
const price = await page.$eval('div.price.item-price-wrap.hidden-xs.hidden-sm > span', span => span.innerText);
var newprice = price.slice(-6);
} catch {
await page.waitForSelector('p.list-price');
const price = await page.$eval('p.list-price', p => p.innerText);
var newprice = price.slice(-6);
} finally {
await page.waitForSelector('div.price.item-price-wrap.hidden-xs.hidden-sm > span.sale-price');
const price = await page.$eval('div.price.item-price-wrap.hidden-xs.hidden-sm > span.sale-price', span => span.innerText);
var newprice = price.slice(-6);
}
console.log(title);
console.log(author);
console.log(genre);
console.log(format);
console.log(publisher);
console.log(newyear);
console.log(newprice);
// return {
// title: title,
// author: author,
// genre: genre,
// format: format,
// publisher: publisher,
// year: newyear,
// price: newprice
// }
}
main();
已更新:使用 Answer
中的代码const puppeteer = require('puppeteer');
const parse = require('csv-parser');
const fs = require('fs');
async function getpageData(page,upc){
await page.goto('https://www.bookdepository.com/');
await page.type('#book-search-form > div.el-wrap.header-search-el-wrap > input.text-input',upc);
await page.click('#book-search-form > div.el-wrap.header-search-el-wrap > button');
//Title
await page.waitForSelector('.item-info h1');
const title = await page.$eval('.item-info h1', h1 => h1.textContent);
//Author
await page.waitForSelector('div.author-info.hidden-md > span > a > span');
const author = await page.$eval('div.author-info.hidden-md > span > a > span', span => span.innerText);
//Genre
await page.waitForSelector('.active a');
const genre = await page.$eval('.active a', a => a.innerText);
//Format
await page.waitForSelector('.item-info li');
const format = await page.$eval('.item-info li', li => li.innerText);
//Publisher
await page.waitForSelector('div.biblio-wrap > div > ul > li:nth-child(4) > span > a > span');
const publisher = await page.$eval('div.biblio-wrap > div > ul > li:nth-child(4) > span > a > span', span => span.innerText);
//Year
await page.waitForSelector('div.biblio-wrap > div > ul > li:nth-child(3) > span');
const year = await page.$eval('div.biblio-wrap > div > ul > li:nth-child(3) > span', span => span.innerText);
const newyear = year.slice(-4)
// Price
try {
await page.waitForSelector('div.price.item-price-wrap.hidden-xs.hidden-sm > span', { timeout: 1000 });
const price = await page.$eval('div.price.item-price-wrap.hidden-xs.hidden-sm > span', span => span.innerText);
var newprice = price.slice(-6);
} catch {
await page.waitForSelector('p.list-price');
const price = await page.$eval('p.list-price', p => p.innerText);
var newprice = price.slice(-6);
} finally {
await page.waitForSelector('div.price.item-price-wrap.hidden-xs.hidden-sm > span.sale-price');
const price = await page.$eval('div.price.item-price-wrap.hidden-xs.hidden-sm > span.sale-price', span => span.innerText);
var newprice = price.slice(-6);
}
// console.log(title);
// console.log(author);
// console.log(genre);
// console.log(format);
// console.log(publisher);
// console.log(newyear);
// console.log(newprice);
return {
title: title,
author: author,
genre: genre,
format: format,
publisher: publisher,
year: newyear,
price: newprice
}
};
function readCsvAsync(filename, delimiter=',', encoding='utf-8') {
return new Promise((resolve, reject) => {
const rows = [];
try {
fs.createReadStream(filename, {encoding: encoding})
.pipe(parse({delimiter: delimiter}))
.on('data', (row) => rows.push(+row.CODECONTENT))
.on('end', () => resolve(rows))
.on('error', reject);
} catch (err) {
reject(err);
}
});
}
async function upcData() {
try {
const rows = await readCsvAsync('Book_Bulk.csv', ':');
// console.log(csvData);
// call puppeteer or whatever
return rows;
} catch (err) {
console.log(err);
}
}
async function main(){
const allupcs = await upcData();
// console.log(allupcs);
const browser = await puppeteer.launch({ headless: false, defaultViewport: null, args: ['--start-maximized']});
const page = await browser.newPage();
const scrapedData = [];
for(let upc of allupcs){
const data = await getpageData(page,upc);
scrapedData.push(data);
}
console.log(scrapedData);
}
main();
如您所见,CSV 解析器是异步的。 “异步”意味着你不能这样做:
var upcData=[]; // 1
fs.createReadStream('Book_Bulk.csv') // 2
.pipe(parse({delimiter: ':'}))
.on('data', (csvrow) { // 5 6 7 8 9
upcData.push(+csvrow.CODECONTENT);
})
.on('end',function() { // 10
console.log(upcData);
});
}
console.log(upcData); // 3
// call puppeteer or whatever // 4
我已经概述了执行顺序。最后一个 console.log()
在您设置读取流后 立即 运行。 upcData
此时将不包含任何内容。
但它将包含点 #10 的数据,#5 等将填充它。
这意味着:无论您想用 upcData
做什么,都请在 'end'
事件处理程序中进行。
.on('end',function() { // 10
console.log(upcData);
for (let upc of upcData) {
// call puppeteer or whatever
}
});
由于 csv reader 将为每个 data
事件提供一行,您也可以直接在 data
事件处理程序中执行操作,而不是构建 upcData
数组完全没有。
.on('data', (csvrow) { // 5 6 7 8 9
const upc = +csvrow.CODECONTENT;
// call puppeteer or whatever
})
如果你想能够await
整件事,你必须先把它变成一个承诺。在这种情况下,相关步骤(承诺解决)再次发生在 end
回调中:
function readCsvAsync(filename, delimiter=',', encoding='utf-8') {
return new Promise((resolve, reject) => {
const rows = [];
try {
fs.createReadStream(filename, {encoding: encoding})
.pipe(parse({delimiter: delimiter}))
.on('data', (row) => rows.push(row))
.on('end', () => resolve(rows))
.on('error', reject);
} catch (err) {
reject(err);
}
});
}
async function main() {
try {
const rows = await readCsvAsync('Book_Bulk.csv', ':');
// call puppeteer or whatever
} catch (err) {
console.log(err);
}
}