Puppeteer- 需要帮助从 h2 和 span 中提取文本
Puppeteer- Need help to extract the text from h2 and span
这里绝对是 JS 初学者。我需要帮助从 DOM 中提取如下所示的文本。
提取可以通过 querySelectorAll() 或 getElementsByTagName() 完成。但我正在寻找的是创建一个对象,每个 h2 元素作为键,跨度作为它的值。我不知道如何实现这一点。任何建议都会很有帮助。
<div class ="product-list">
<div class="row column">
<div class="column medium-9 large-10">
<h2 class="product-name">Products List 1</h2>
</div>
</div>
<div class="row">
<span>First Product</span>
</div>
<div class="row">
<span> Second Product</span>
</div>
.
.
.
<div class="row">
<span>
Nth Product
</span>
</div>
<div class="row column">
<div class="column medium-9 large-10">
<h2 class="product-name">Products List 2</h2>
</div>
</div>
<div class="row">
<span>Thrid Product</span>
</div>
<div class="row">
<span> Fourth Product</span>
</div>
.
.
.
<div class="row">
<span>
Nth Product
</span>
</div>
</div>
由此DOM我需要将数据存储为
[
Products List 1 :[First Product,Second Product...Nth Product],
Products List 2 :[Third Product,Fourth Product...Nth Product]
]
JS:
const products=await page.evaluate(()=>{
const productsArray=[];
var pdName1=document.querySelectorAll('div.column > h2.product-name');
var pdName2=document.querySelectorAll("div.row > span")
pdName2.forEach(query=>{
productArray.push(query.innerText)
})
return productArray
})
您可以尝试这样的操作:
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch();
const html = `
<!doctype html>
<html>
<head><meta charset='UTF-8'><title>Test</title></head>
<body>
<div class ="product-list">
<div class="row column">
<div class="column medium-9 large-10">
<h2 class="product-name">Products List 1</h2>
</div>
</div>
<div class="row"><span>First Product</span></div>
<div class="row"><span> Second Product</span></div>
<div class="row"><span>Nth Product</span></div>
<div class="row column">
<div class="column medium-9 large-10">
<h2 class="product-name">Products List 2</h2>
</div>
</div>
<div class="row"><span>Thrid Product</span></div>
<div class="row"><span> Fourth Product</span></div>
<div class="row"><span>Nth Product</span></div>
</div>
</body>
</html>`;
try {
const [page] = await browser.pages();
await page.goto(`data:text/html,${html}`);
const data = await page.evaluate(() => {
const elements = document.querySelectorAll('h2, div.row span');
const list = {};
let currentKey = null;
for (const element of elements) {
if (element.tagName === 'H2') {
currentKey = element.innerText;
list[currentKey] = [];
} else {
list[currentKey].push(element.innerText);
}
}
return list;
});
console.log(data);
} catch (err) { console.error(err); } finally { await browser.close(); }
这里绝对是 JS 初学者。我需要帮助从 DOM 中提取如下所示的文本。 提取可以通过 querySelectorAll() 或 getElementsByTagName() 完成。但我正在寻找的是创建一个对象,每个 h2 元素作为键,跨度作为它的值。我不知道如何实现这一点。任何建议都会很有帮助。
<div class ="product-list">
<div class="row column">
<div class="column medium-9 large-10">
<h2 class="product-name">Products List 1</h2>
</div>
</div>
<div class="row">
<span>First Product</span>
</div>
<div class="row">
<span> Second Product</span>
</div>
.
.
.
<div class="row">
<span>
Nth Product
</span>
</div>
<div class="row column">
<div class="column medium-9 large-10">
<h2 class="product-name">Products List 2</h2>
</div>
</div>
<div class="row">
<span>Thrid Product</span>
</div>
<div class="row">
<span> Fourth Product</span>
</div>
.
.
.
<div class="row">
<span>
Nth Product
</span>
</div>
</div>
由此DOM我需要将数据存储为
[
Products List 1 :[First Product,Second Product...Nth Product],
Products List 2 :[Third Product,Fourth Product...Nth Product]
]
JS:
const products=await page.evaluate(()=>{
const productsArray=[];
var pdName1=document.querySelectorAll('div.column > h2.product-name');
var pdName2=document.querySelectorAll("div.row > span")
pdName2.forEach(query=>{
productArray.push(query.innerText)
})
return productArray
})
您可以尝试这样的操作:
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch();
const html = `
<!doctype html>
<html>
<head><meta charset='UTF-8'><title>Test</title></head>
<body>
<div class ="product-list">
<div class="row column">
<div class="column medium-9 large-10">
<h2 class="product-name">Products List 1</h2>
</div>
</div>
<div class="row"><span>First Product</span></div>
<div class="row"><span> Second Product</span></div>
<div class="row"><span>Nth Product</span></div>
<div class="row column">
<div class="column medium-9 large-10">
<h2 class="product-name">Products List 2</h2>
</div>
</div>
<div class="row"><span>Thrid Product</span></div>
<div class="row"><span> Fourth Product</span></div>
<div class="row"><span>Nth Product</span></div>
</div>
</body>
</html>`;
try {
const [page] = await browser.pages();
await page.goto(`data:text/html,${html}`);
const data = await page.evaluate(() => {
const elements = document.querySelectorAll('h2, div.row span');
const list = {};
let currentKey = null;
for (const element of elements) {
if (element.tagName === 'H2') {
currentKey = element.innerText;
list[currentKey] = [];
} else {
list[currentKey].push(element.innerText);
}
}
return list;
});
console.log(data);
} catch (err) { console.error(err); } finally { await browser.close(); }