抓取超过一页
Scrape greater than one page
我正在尝试从该站点 https://aabalat.com/wine/country/france 中提取数据(名称、品种、格式和价格)。我创建了一个名为 $urls 的数组,并将每个 link 推入数组中。对于每个新的 curl 会话,我将获得 20 个关于 wine 的新数据。我需要首先捕获格式并推送到数组,如您在下面的代码中所见。当我打印 $french_wines_formats_matches 时,它成功运行。但是当我想打印 $french_wines_format_array 时,它运行得不是很好。
我是数据抓取方面的新手,对此经验不多。
// Array contains 197 links
$urls = array();
array_push($urls, "https://aabalat.com/wine/country/france");
// This for loop makes others links
for($i = 1; $i < 5; $i++)
{
$urls[] = "https://aabalat.com/wine/country/france?page=".$i;
}
// echo "<pre>";
// print_r($urls);
// echo "</pre>";
$french_wines_array = array();
$french_wines_title_array = array();
$french_wines_varietal_array = array();
$french_wines_format_array = array();
$french_wines_price_array = array();
// Repeat curl session until url exists.
foreach($urls as $url)
{
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_VERBOSE, true);
$output = curl_exec($curl);
$info = curl_getinfo($curl);
$err = curl_error($curl);
$ern = curl_errno($curl);
$french_wine_formats_pattern = '!<span class="wine-list-item-format">(.*)</span>!mi';
preg_match_all($french_wine_formats_pattern, $output, $french_wines_formats_matches);
foreach($french_wines_formats_matches[0] as $french_wines_formats_match)
{
$french_wines_format_array[] = $french_wines_formats_match;
}
echo "<pre>";
print_r($french_wines_format_array);
echo "</pre>";
curl_close($curl);
sleep(rand(2, 5));
}
您的代码和正则表达式似乎有效 (I tried them)。我无法复制您的 cURL 调用。尝试以下而不只是 $output = curl_exec($curl)
,看看是否发现任何 cURL 错误:
if(!$output = curl_exec($curl)){
if (curl_error($ch)) {
die(curl_error($ch));
}
}
最后,我尝试了一个简单的 file_get_contents()
并且似乎有效:
$url = "https://aabalat.com/wine/country/france";
$output= file_get_contents($url);
我正在尝试从该站点 https://aabalat.com/wine/country/france 中提取数据(名称、品种、格式和价格)。我创建了一个名为 $urls 的数组,并将每个 link 推入数组中。对于每个新的 curl 会话,我将获得 20 个关于 wine 的新数据。我需要首先捕获格式并推送到数组,如您在下面的代码中所见。当我打印 $french_wines_formats_matches 时,它成功运行。但是当我想打印 $french_wines_format_array 时,它运行得不是很好。
我是数据抓取方面的新手,对此经验不多。
// Array contains 197 links
$urls = array();
array_push($urls, "https://aabalat.com/wine/country/france");
// This for loop makes others links
for($i = 1; $i < 5; $i++)
{
$urls[] = "https://aabalat.com/wine/country/france?page=".$i;
}
// echo "<pre>";
// print_r($urls);
// echo "</pre>";
$french_wines_array = array();
$french_wines_title_array = array();
$french_wines_varietal_array = array();
$french_wines_format_array = array();
$french_wines_price_array = array();
// Repeat curl session until url exists.
foreach($urls as $url)
{
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_VERBOSE, true);
$output = curl_exec($curl);
$info = curl_getinfo($curl);
$err = curl_error($curl);
$ern = curl_errno($curl);
$french_wine_formats_pattern = '!<span class="wine-list-item-format">(.*)</span>!mi';
preg_match_all($french_wine_formats_pattern, $output, $french_wines_formats_matches);
foreach($french_wines_formats_matches[0] as $french_wines_formats_match)
{
$french_wines_format_array[] = $french_wines_formats_match;
}
echo "<pre>";
print_r($french_wines_format_array);
echo "</pre>";
curl_close($curl);
sleep(rand(2, 5));
}
您的代码和正则表达式似乎有效 (I tried them)。我无法复制您的 cURL 调用。尝试以下而不只是 $output = curl_exec($curl)
,看看是否发现任何 cURL 错误:
if(!$output = curl_exec($curl)){
if (curl_error($ch)) {
die(curl_error($ch));
}
}
最后,我尝试了一个简单的 file_get_contents()
并且似乎有效:
$url = "https://aabalat.com/wine/country/france";
$output= file_get_contents($url);