使用 Goutte 抓取网站会挂起,直到特定站点超时
Scraping website with Goutte hangs until timeout on specific site
我在玩 Goutte,但无法连接到某个网站。所有其他 URL 似乎都运行良好,我正在努力了解是什么阻止了它连接。它只是挂起,直到 30 秒后超时。如果我删除超时,同样的情况会在 150 秒后发生。
注意要点:
- 此超时/挂起仅发生在我目前发现的 tesco.com 上。 asda.com、google.com 等工作正常,return 结果。
- 网站会在网络浏览器 (Chrome) 中立即加载(与 IP 或 ISP 无关)。
- 如果我在 Postman 中向同一个 URL.
发出 GET 请求,我得到的结果 return 很好
- 似乎与用户代理无关。
<?php
namespace App\Http\Controllers;
use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;
class ScraperController extends Controller
{
public function scrape()
{
$goutteClient = new Client();
$goutteClient->setHeader('user-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36');
$guzzleClient = new GuzzleClient(array(
'timeout' => 30,
'verify' => true,
'debug' => true,
));
$goutteClient->setClient($guzzleClient);
$crawler = $goutteClient->request('GET', 'https://www.tesco.com/');
dump($crawler);
/*$crawler->filter('.result__title .result__a')->each(function ($node) {
dump($node->text());
});*/
}
}
这是“调试”输出,包括错误:
* Trying 104.123.91.150:443... * TCP_NODELAY set * Connected to www.tesco.com (104.123.91.150) port 443 (#0) * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/ssl/certs/ca-certificates.crt CApath: /etc/ssl/certs * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN, server accepted to use http/1.1 * Server certificate: * subject: C=GB; L=Welwyn Garden City; jurisdictionC=GB; O=Tesco PLC; businessCategory=Private Organization; serialNumber=00445790; CN=www.tesco.com * start date: Feb 4 11:09:23 2020 GMT * expire date: Feb 3 11:39:21 2022 GMT * subjectAltName: host "www.tesco.com" matched cert's "www.tesco.com" * issuer: C=US; O=Entrust, Inc.; OU=See www.entrust.net/legal-terms; OU=(c) 2014 Entrust, Inc. - for authorized use only; CN=Entrust Certification Authority - L1M * SSL certificate verify ok. > GET / HTTP/1.1 Host: www.tesco.com user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 * old SSL session ID is stale, removing * Operation timed out after 30001 milliseconds with 0 bytes received * Closing connection 0
GuzzleHttp\Exception\ConnectException
cURL error 28: Operation timed out after 30001 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html)
http://localhost/scrape
谁能看出为什么我没有收到任何回复?
设法通过添加更多内容来解决此问题headers:
<?php
namespace App\Http\Controllers;
use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;
class ScraperController extends Controller
{
public function scrape()
{
$goutteClient = new Client();
$goutteClient->setHeader('accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9');
$goutteClient->setHeader('accept-encoding', 'gzip, deflate, br');
$goutteClient->setHeader('accept-language', 'en-GB,en-US;q=0.9,en;q=0.8');
$goutteClient->setHeader('upgrade-insecure-requests', '1');
$goutteClient->setHeader('user-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36');
$goutteClient->setHeader('connection', 'keep-alive');
$guzzleClient = new GuzzleClient(array(
'timeout' => 5,
'verify' => true,
'debug' => true,
'cookies' => true,
));
$goutteClient->setClient($guzzleClient);
$crawler = $goutteClient->request('GET', 'https://www.tesco.com/');
dump($crawler);
/*$crawler->filter('.result__title .result__a')->each(function ($node) {
dump($node->text());
});*/
}
}
我在玩 Goutte,但无法连接到某个网站。所有其他 URL 似乎都运行良好,我正在努力了解是什么阻止了它连接。它只是挂起,直到 30 秒后超时。如果我删除超时,同样的情况会在 150 秒后发生。
注意要点:
- 此超时/挂起仅发生在我目前发现的 tesco.com 上。 asda.com、google.com 等工作正常,return 结果。
- 网站会在网络浏览器 (Chrome) 中立即加载(与 IP 或 ISP 无关)。
- 如果我在 Postman 中向同一个 URL. 发出 GET 请求,我得到的结果 return 很好
- 似乎与用户代理无关。
<?php
namespace App\Http\Controllers;
use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;
class ScraperController extends Controller
{
public function scrape()
{
$goutteClient = new Client();
$goutteClient->setHeader('user-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36');
$guzzleClient = new GuzzleClient(array(
'timeout' => 30,
'verify' => true,
'debug' => true,
));
$goutteClient->setClient($guzzleClient);
$crawler = $goutteClient->request('GET', 'https://www.tesco.com/');
dump($crawler);
/*$crawler->filter('.result__title .result__a')->each(function ($node) {
dump($node->text());
});*/
}
}
这是“调试”输出,包括错误:
* Trying 104.123.91.150:443... * TCP_NODELAY set * Connected to www.tesco.com (104.123.91.150) port 443 (#0) * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/ssl/certs/ca-certificates.crt CApath: /etc/ssl/certs * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN, server accepted to use http/1.1 * Server certificate: * subject: C=GB; L=Welwyn Garden City; jurisdictionC=GB; O=Tesco PLC; businessCategory=Private Organization; serialNumber=00445790; CN=www.tesco.com * start date: Feb 4 11:09:23 2020 GMT * expire date: Feb 3 11:39:21 2022 GMT * subjectAltName: host "www.tesco.com" matched cert's "www.tesco.com" * issuer: C=US; O=Entrust, Inc.; OU=See www.entrust.net/legal-terms; OU=(c) 2014 Entrust, Inc. - for authorized use only; CN=Entrust Certification Authority - L1M * SSL certificate verify ok. > GET / HTTP/1.1 Host: www.tesco.com user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 * old SSL session ID is stale, removing * Operation timed out after 30001 milliseconds with 0 bytes received * Closing connection 0
GuzzleHttp\Exception\ConnectException
cURL error 28: Operation timed out after 30001 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html)
http://localhost/scrape
谁能看出为什么我没有收到任何回复?
设法通过添加更多内容来解决此问题headers:
<?php
namespace App\Http\Controllers;
use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;
class ScraperController extends Controller
{
public function scrape()
{
$goutteClient = new Client();
$goutteClient->setHeader('accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9');
$goutteClient->setHeader('accept-encoding', 'gzip, deflate, br');
$goutteClient->setHeader('accept-language', 'en-GB,en-US;q=0.9,en;q=0.8');
$goutteClient->setHeader('upgrade-insecure-requests', '1');
$goutteClient->setHeader('user-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36');
$goutteClient->setHeader('connection', 'keep-alive');
$guzzleClient = new GuzzleClient(array(
'timeout' => 5,
'verify' => true,
'debug' => true,
'cookies' => true,
));
$goutteClient->setClient($guzzleClient);
$crawler = $goutteClient->request('GET', 'https://www.tesco.com/');
dump($crawler);
/*$crawler->filter('.result__title .result__a')->each(function ($node) {
dump($node->text());
});*/
}
}