简单 html dom 始终加载默认第一页而不是指定的 url

Question

我想抓取几个网页。我正在使用 php 和简单的 html dom 解析器。例如试图抓取这个网站：https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=5

我用这个加载url。

$html = new simple_html_dom();
$html->load_file($url);

这会加载正确的页面。然后我找到下一页 link，这里是： https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=6

只是页面值从 5 更改为 6。获取下一个 link 的代码片段是：

function getNextLink($_htmlTemp)
{
    //Getting the next page links
    $aNext = $_htmlTemp->find('a.next', 0);
    $nextLink = $aNext->href;    
    return $nextLink;
}

以上方法returns正确link页值为6。现在，当我尝试加载下一个 link 时，它会获取 url.

中缺少页面查询的第一个默认页面

//After loop we will have details of all the listing in this page -- so get next page link
    $nxtLink = getNextLink($originalHtml);  //Returns string url
    if(!empty($nxtLink))
    {
        //Yay, we have the next link -- load the next link        
        print 'Next Url: '.$nxtLink.'<br>'; //$nxtLink has correct value
        $originalHtml->load_file($nxtLink); //This line fetches default page
    }

整个流程是这样的：

 $html->load_file($url);


//Whole thing in a do-while loop
$originalHtml = $html;
$shouldLoop = true;
//Main Array
$value = array();
do{
    $listings = $originalHtml->find('div.searchResult');    
    foreach($listings as $item)
    {
        //Some logic here
    }


    //After loop we will have details of all the listing in this page -- so get next page link
    $nxtLink = getNextLink($originalHtml);  //Returns string url
    if(!empty($nxtLink))
    {
        //Yay, we have the next link -- load the next link        
        print 'Next Url: '.$nxtLink.'<br>';
        $originalHtml->load_file($nxtLink);
    }
    else
    {
        //No next link -- stop the loop as we have covered all the pages
        $shouldLoop = false;
    }

} while($shouldLoop);

我已经尝试对整个 url 进行编码，仅对查询参数进行编码，但结果相同。我还尝试创建 simple_html_dom 的新实例，然后加载文件，但没有成功。请帮忙。

Answer 1

你需要 html_entity_decode 这些链接，我可以看到它们被简单的-html-dom.

破坏了

$url = 'https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes';
$html = str_get_html(file_get_contents($url));

while($a = $html->find('a.next', 0)){
  $url = html_entity_decode($a->href);
  echo $url . "\n";
  $html = str_get_html(file_get_contents($url));
}

简单 html dom 始终加载默认第一页而不是指定的 url

Simple html dom always loading the default first page and not the specified url

html

php

parsing

dom

simple-html-dom