从网站中提取特定栏目的内容 table

Question

我正在尝试从网站 https://www.passwordrandom.com/most-popular-passwords 的 table 提取所有密码。我只是想拉出每个 td 中的第二个元素，第一个 tr 除外。当我运行代码时，数组中的所有内容都为空。

我试过摆弄选择器，但我不确定该怎么做。我在想也许这些论点是错误的，但不确定它应该是什么样子。

const puppeteer = require('puppeteer')
const fs = require('fs')

const baseURL = 'https://www.passwordrandom.com/most-popular-passwords'

async function scrape() {
    const browser = await puppeteer.launch()

    const page = await browser.newPage()
    console.log('Puppeteer Initialized')

    await page.goto(baseURL)

    const allNodes = await page.evaluate(() => {
        return document.querySelectorAll("#cntContent_lstMain tr:not(:first-child) td:nth-child(2)")
    })

    const allWords = []

    for (let row in allNodes)
        allWords.push(allNodes[row].textContent)

    console.log(allWords)

    await browser.close();
}

scrape()

本质上，结果应该是一个数组，其中包含 table 中的每个密码。密码在每个 td 的第二个元素中都有帮助，除了第一个 tr（就像我上面说的）。

Answer 1

page.evaluate里面的代码运行在浏览器里面，外面的代码运行在node上。

当您 return 使用 document.querySelectorAll 的元素时，它 return 是一个 NodeList，然后对其进行序列化，并且由于序列化而导致数据丢失（或引用不同）。这意味着，allNodes[row].textContent 将不起作用。

最简单的方法是return里面的数据从page.evaluate.

const allNodes = await page.evaluate(() => {
  const elements = [...document.querySelectorAll("#cntContent_lstMain tr:not(:first-child) td:nth-child(2)")]
  return elements.map(element=>element.textContent)
})

它将为您提供具有该选择器的所有可用元素的 textContent。

从网站中提取特定栏目的内容 table

Pulling specific column's content from website table

javascript

selectors-api

puppeteer