用 puppeteer 抓取字典网站

Question

我正在尝试抓取一个字典网站（这个“http://rjecnik.hr/”），其中包含所有字母中的所有单词。设法做到了部分。我设法遍历页面，但无法实现遍历每个字母然后将该信息保存在文档中。在整个互联网上搜索，只是看不到我的问题的解决方案。补充一点，我是编程的初学者，还在学习东西。可能是一个我看不到的简单解决方案。这是代码，也不是我自己写的代码，但是我能理解每一部分的作用。

// Browser
const puppeteer = require('puppeteer');

// funkcija se odnosi na puppeteer
(async () => {
    // Izvlačenje riječi sa stranice, pomoću rekurzije provjerit iduće stranice.
    const izvuciRijeci = async (url) => 
    {
        // Izvlačenje (Scraping) podataka koje želimo. // Scraping data we want.
        const page = await browser.newPage()
        await page.goto(url)
        //console.log(`Scraping: ${url}`); // Debugging
        const rijeciNaStranici = await page.evaluate(() => Array.from(document.querySelectorAll('.word')).map((rijeci) => rijeci.innerText.trim())); // Getting the words from a page.
        await page.close();

        // Provjera iduće stranice pomoću rekurzije. // Checkin next page using recursion.
        if (rijeciNaStranici.length < 1) 
        {
            // Prekidanje ako nema riječi. // Stop if no more words.
            //console.log (`Terminate recursion on: ${url}`) // Debugging
            return rijeciNaStranici
        }
        else 
        {
        // Dohvati iduću stranicu načinom "?page=X+1". // Get next page using "?page=X+1".
        const  nextPageNumber = parseInt(url.match(/page=(\d+)$/)[1], 10) + 1;
        const nextUrl = `http://rjecnik.hr/?letter=a&page=${nextPageNumber}`;
        
        return rijeciNaStranici.concat(await izvuciRijeci(nextUrl))
        }
    }

    const browser = await puppeteer.launch();
    const url = "http://rjecnik.hr/?letter=a&page=1";
    const rijec = await izvuciRijeci(url);

    // Todo: Ažurirati bazu s riječima
    console.log(rijec);

// Spremanje u datoteku. // Save to file.
const content = rijec.toString();

var fs = require('fs');

fs.writeFile("rijeci.txt", content, function (err){
    if (err) {
        console.log(err);
    } else {
        console.log("File saved");
    }
});

    await browser.close();
})();

Answer 1

如果您认为此解决方案有用且有帮助，请select将其作为正确答案。

首先，您不需要在每次加载新内容时都打开和关闭页面 URL。当浏览器启动时，您可以简单地使用已经打开的页面。

// const page = await page.newPage()    // <= this is also not efficient enough
// await page.close()                   // <= this is unnecessary and way too heavy
                                        // == You can use these method instead
const page = (await browser.pages())[0] // <= this way is lot better and lighter

然后您需要在一个数组中列出所有可用的字母：

const getLettersArray = async (url) => {
    const page = (await browser.pages())[0] // Use the first page already opened, to keep it light
    await page.goto(url)
    return await page.evaluate(() => Array.from(document.querySelectorAll('.alphabet ul > li')).map(element => element.innerText))
}

然后要定义 selected 或活动字母，您可以使用如下正则表达式进行检查，（注意：由于字典使用了一些 non-English QWERTY 字符，我添加了 {1.6}在参数中)

const letterInUse = url.match(/letter=(.{1,6})&page=(\d+)$/)[1] // Get the letter used in the page

我添加了更多方法，因此您可以运行下面这个完整的功能脚本：

// Browser
const puppeteer = require('puppeteer')
const fs = require('fs')

// funkcija se odnosi na puppeteer
;(async () => {
    const getLettersArray = async (url) => {
        const page = (await browser.pages())[0] // Use the first page already opened, to keep it light
        await page.goto(url)
        return await page.evaluate(() => Array.from(document.querySelectorAll('.alphabet ul > li')).map(element => element.innerText))
    }
    // Izvlačenje riječi sa stranice, pomoću rekurzije provjerit iduće stranice.
    const izvuciRijeci = async (url, allLetters) => {
        // Izvlačenje (Scraping) podataka koje želimo. // Scraping data we want.
        const page = (await browser.pages())[0] // Use the first page already opened, to keep it light
        await page.goto(url)
        //console.log(`Scraping: ${url}`); // Debugging
        const rijeciNaStranici = await page.evaluate(() => Array.from(document.querySelectorAll('.word')).map((rijeci) => rijeci.innerText.trim())) // Getting the words from a page.
        // await page.close() // Don't close page when it can be reused for efficiency and effectivity

        // Provjera iduće stranice pomoću rekurzije. // Checkin next page using recursion.
        if (rijeciNaStranici.length < 1) {
            // Prekidanje ako nema riječi. // Stop if no more words.
            // console.log (`Terminate recursion on: ${url}`) // Debugging
            return rijeciNaStranici
        } else {
            // Dohvati iduću stranicu načinom "?page=X+1". // Get next page using "?page=X+1".
            const nextPageNumber = parseInt(url.match(/page=(\d+)$/)[1], 10) + 1
            const letterInUse = url.match(/letter=(.{1,6})&page=(\d+)$/)[1] // Get the letter used in the page
            const letterIndexed = allLetters.findIndex(value => value === letterInUse.toUpperCase()) + 1
            if (letterIndexed > allLetters.length) {
                return []
            }
            const nextLetter = allLetters.at(letterIndexed) // Get the next letter after this letter
            const nextLetterUrl = `http://rjecnik.hr/?letter=${nextLetter}&page=1`
            const nextUrl = `http://rjecnik.hr/?letter=${letterInUse}&page=${nextPageNumber}`
            const nextPageArray = await izvuciRijeci(nextUrl, allLetters)
            if (nextPageArray.length) {
                return rijeciNaStranici.concat(nextPageArray)
            } else {
                const nextLetterArray = await izvuciRijeci(nextLetterUrl, allLetters)
                return rijeciNaStranici.concat(nextLetterArray)
            }
        }
    }

    const browser = await puppeteer.launch({headless: true})
    const url = "http://rjecnik.hr/?letter=a&page=1"
    const allLetters = await getLettersArray(url)
    const rijec = await izvuciRijeci(url, allLetters)

    // Todo: Ažurirati bazu s riječima
    console.log(rijec)

    // Spremanje u datoteku. // Save to file.
    const content = rijec.toString()


    fs.writeFile('rijeci.txt', content, function (error) {
        if (error) {
            console.log(error)
        } else {
            console.log('File saved')
        }
    });

    await browser.close()
})()

用 puppeteer 抓取字典网站

scraping a dictionary website with puppeteer

javascript

dom

web-scraping

puppeteer