无法抓取遍历多个页面的内容

Question

我在 php 中编写了一个脚本，用于从网页中抓取 titles 及其 links。该网页显示其遍历多个页面的内容。我的以下脚本可以从其着陆页解析 titles 和 links。

如何修正我现有的脚本以从多个页面（最多 10 页）获取数据？

这是我目前的尝试：

<?php
include "simple_html_dom.php";
$link = "https://whosebug.com/questions/tagged/web-scraping?page=2";
function get_content($url)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $htmlContent = curl_exec($ch);
    curl_close($ch);
    $dom = new simple_html_dom();
    $dom->load($htmlContent);
    foreach($dom->find('.question-summary') as $file){
        $itemTitle = $file->find('.question-hyperlink', 0)->innertext;
        $itemLink = $file->find('.question-hyperlink', 0)->href;
        echo "{$itemTitle},{$itemLink}<br>";
    }
}
get_content($link);
?>

网站递增其页面，如 ?page=2、?page=3 e.t.c。

Answer 1

以下是我使用 XPath 的方法：

$url = 'https://whosebug.com/questions/tagged/web-scraping';

$dom = new DOMDocument();
$source = loadUrlSource($url);
$dom->loadHTML($source);

$xpath = new DOMXPath($dom);
$domPage = new DOMDocument();
$domPage->loadHTML($source);
$xpath_page = new DOMXPath($domPage);

// Find page links with the title "go to page" within the div container that contains "pager" class.
$pageItems = $xpath_page->query("//div[contains(@class, 'pager')]//a[contains(@title, 'go to page')]");

// Get last page number. 
// Since you will look once at the beginning for the page number, subtract by 2 because the link "next" has title "go to page" as well.
$pageCount = (int)$pageItems[$pageItems->length-2]->textContent;

// Loop every page
for($page=1; $page < $pageCount; $page++) {

    $source = loadUrlSource($url . "?page={$page}");

    // Do whatever with the source. You can also call simple_html_dom on the content.
    // $dom = new simple_html_dom();
    // $dom->load($source);

}

Answer 2

我就是这样成功的（配合尼玛的建议）

<?php
include "simple_html_dom.php";
$link = "https://whosebug.com/questions/tagged/web-scraping?page="; 

function get_content($url)
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        $htmlContent = curl_exec($ch);
        curl_close($ch);
        $dom = new simple_html_dom();
        $dom->load($htmlContent);
        foreach($dom->find('.question-summary') as $file){
            $itemTitle = $file->find('.question-hyperlink', 0)->innertext;
            $itemLink = $file->find('.question-hyperlink', 0)->href;
            echo "{$itemTitle},{$itemLink}<br>";
        }
    }
for($i = 1; $i<10; $i++){
        get_content($link.$i);
    }
?>

无法抓取遍历多个页面的内容

Unable to grab content traversing multiple pages

php

curl

simple-html-dom

web-scraping