在 puppeteer 中滚动到 div 的底部不起作用

Question

所以我正在尝试抓取下图中框出区域中的所有音乐会：

https://i.stack.imgur.com/7QIMM.jpg

问题是列表只显示前 10 个选项，直到您在特定 div 中向下滚动到底部，然后它会动态显示更多选项，直到没有更多结果。我尝试按照下面的 link 回答，但无法向下滚动以显示所有 'concerts':

这是我的基本代码：

const browser = await puppeteerExtra.launch({ args: [                
    '--no-sandbox'                                                  
    ]});

async function functionName() {
    const page = await browser.newPage();
    await preparePageForTests(page);
    page.once('load', () => console.log('Page loaded!'));
    await page.goto(`https://www.google.com/search?q=concerts+near+poughkeepsie&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail`);   

    const resultList = await page.waitForSelector(".odIJnf"); 
    const scrollableSection = await page.waitForSelector("#Q5Vznb");    //I think this is the div that contains all the concert items.
    const results = await page.$$(".odIJnf");  //this needs to be iterable to be used in the for loop

//this is where I'd like to scroll down the div all the way to the bottom

    for (let i = 0; i < results.length; i++) {
      const result = await (await results[i].getProperty('innerText')).jsonValue();
      console.log(result)
    }
}

Answer 1

试试这个以向下滚动音乐会列表。您可以一直循环直到结果数量停止增加，或者找到您要找的音乐会：

await page.evaluate(()=>{
  document.querySelector("#Q5Vznb").scrollIntoView(false);
});

Answer 2

正如您在问题中提到的，当您运行 page.$$ 时，您会得到一个 ElementHandle 的数组。来自 Puppeteer's documentation:

ElementHandle represents an in-page DOM element. ElementHandles can be created with the page.$ method.

这意味着您可以遍历它们，但您还必须运行 evaluate() 或 $eval() 遍历每个元素才能访问 DOM 元素。

我从您的代码片段中看到您正在尝试访问处理列表 scroll 事件的父级 div。问题是这个页面似乎使用了自动生成的 classes 和 ids。这可能会使您的代码变得脆弱或无法正常工作。最好尝试直接访问 ul、li、div。

我创建了这个片段，可以从该站点获取 ITEMS 场音乐会：

const puppeteer = require('puppeteer')

/**
 * Constants
 */
const ITEMS = process.env.ITEMS   || 50
const URL   = process.env.URL     || "https://www.google.com/search?q=concerts+near+poughkeepsie&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail"

/**
 * Main
 */
main()
  .then( ()    => console.log("Done"))
  .catch((err) => console.error(err))

/**
 * Functions
 */
async function main() {
  const browser = await puppeteer.launch({ args: ["--no-sandbox"] })
  const page = await browser.newPage()
  
  await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
  await page.goto(URL)
 
  const results = await getResults(page)
  console.log(results)
  
  await browser.close()
}

async function getResults(page) {
  await page.waitForSelector("ul")
  const ul  = (await page.$$("ul"))[0]
  const div = (await ul.$x("../../.."))[0]
  const results = []
  
  const recurse = async () => {
    // Recurse exit clause
    if (ITEMS <= results.length) {
      return
    }

    const $lis = await page.$$("li")
    // Slicing this way will avoid duplicating the result. It also has
    // the benefit of not having to handle the refresh interval until
    // new concerts are available.
    const lis = $lis.slice(results.length, Math.Infinity)
    for (let li of lis) {
      const result = await li.evaluate(node => node.innerText)
      results.push(result)
    }
    // Move the scroll of the parent-parent-parent div to the bottom
    await div.evaluate(node => node.scrollTo(0, node.scrollHeight))
    await recurse()
  }
  // Start the recursive function
  await recurse()
 
  return results
}

通过研究页面结构，我们看到列表的 ul 嵌套在处理 scroll 的 div 深处的三个 div 中。我们也知道页面上只有两个ul，第一个就是我们要的。那是我们在这些方面做了什么：

  const ul  = (await page.$$("ul"))[0]
  const div = (await ul.$x("../../.."))[0]

$x 函数计算相对于文档的 XPath 表达式作为其上下文节点*。它允许我们遍历 DOM 树，直到找到我们需要的 div。然后我们运行一个递归函数，直到我们得到我们想要的项目。

Taken from the docs.

在 puppeteer 中滚动到 div 的底部不起作用

Scrolling to the bottom of a div in puppeteer not working

javascript

node.js

web-scraping

infinite-scroll

puppeteer