抓取超过一页

Question

我正在尝试从该站点 https://aabalat.com/wine/country/france 中提取数据（名称、品种、格式和价格）。我创建了一个名为 $urls 的数组，并将每个 link 推入数组中。对于每个新的 curl 会话，我将获得 20 个关于 wine 的新数据。我需要首先捕获格式并推送到数组，如您在下面的代码中所见。当我打印 $french_wines_formats_matches 时，它成功运行。但是当我想打印 $french_wines_format_array 时，它运行得不是很好。

我是数据抓取方面的新手，对此经验不多。

    // Array contains 197 links
$urls = array();
array_push($urls, "https://aabalat.com/wine/country/france");


// This for loop makes others links
for($i = 1; $i < 5; $i++)
{
  $urls[] = "https://aabalat.com/wine/country/france?page=".$i;
}

// echo "<pre>";
// print_r($urls);
// echo "</pre>";

$french_wines_array = array();
$french_wines_title_array = array();
$french_wines_varietal_array = array();
$french_wines_format_array = array();
$french_wines_price_array = array();

// Repeat curl session until url exists.
foreach($urls as $url)
{
  $curl = curl_init();
  curl_setopt($curl, CURLOPT_URL, $url);

  curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
  curl_setopt($curl, CURLOPT_VERBOSE, true);

  $output = curl_exec($curl);
  $info = curl_getinfo($curl);
  $err = curl_error($curl);
  $ern = curl_errno($curl);

  $french_wine_formats_pattern = '!<span class="wine-list-item-format">(.*)</span>!mi';
  preg_match_all($french_wine_formats_pattern, $output, $french_wines_formats_matches);

  foreach($french_wines_formats_matches[0] as $french_wines_formats_match)
  {
    $french_wines_format_array[] = $french_wines_formats_match;
  }

  echo "<pre>";
  print_r($french_wines_format_array);
  echo "</pre>";

curl_close($curl);
sleep(rand(2, 5));

}

Answer 1

您的代码和正则表达式似乎有效 (I tried them)。我无法复制您的 cURL 调用。尝试以下而不只是 $output = curl_exec($curl)，看看是否发现任何 cURL 错误：

    if(!$output = curl_exec($curl)){
        if (curl_error($ch)) {
            die(curl_error($ch));
        }
    }

最后，我尝试了一个简单的 file_get_contents() 并且似乎有效：

    $url = "https://aabalat.com/wine/country/france";
    $output= file_get_contents($url);

抓取超过一页

Scrape greater than one page

php

curl

screen-scraping

web