文件获取内容或 cURL 获取 404 页面而不是主字符串

Question

我试图从网站获取字符串，但我得到的是外部网站的 404 页面而不是索引页面字符串。

cURL 和 file_get_contents 我都试过了。 return 来自外部网站的 404 而不是 return 索引页面的字符串。

$homepage = file_get_contents("https://www.creditkarma.ca");
echo $homepage;

卷曲：

$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';

function file_get_contents_curl($url) {
$ch = curl_init();

curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);   
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);   
curl_setopt($ch, CURLOPT_VERBOSE, true);    

$data = curl_exec($ch);
curl_close($ch);

return $data;
}
$homepage = file_get_contents_curl("https://www.creditkarma.ca");
echo $homepage;

代码应该 return 索引页面的字符串，但它 return 来自外部网站的 404 页面。我该如何解决这个问题。我需要一串索引页。

注意：它 returning 外部网站的 404 不是来自我的 .htaccess

Answer 1

使用 CURL 语句，如果要检索页面的 HTML，则应使用 headers。作为安全预防措施，如果浏览器信息不明显，许多网站将拒绝流量（或导致 404）。所以当我这样做时.. 我试着“模仿”我的陈述，就好像它是一个浏览器一样。像这样的东西应该符合要求——如上面更新的代码中所述，您不表示“代理”：

$url="https://www.creditkarma.ca";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';

$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
var_dump($result);

更新

我已经将其作为“独立”脚本进行了测试 php .. 并得到以下结果：

*   Trying 104.100.143.79:443...
* TCP_NODELAY set
* Connected to www.creditkarma.ca (104.100.143.79) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: businessCategory=Private Organization; jurisdictionC=US; jurisdictionST=Delaware; serialNumber=4313894; C=US; ST=California; L=San Francisco; O=Credit Karma Inc.; CN=www.creditkarma.ca
*  start date: Mar 16 00:00:00 2020 GMT
*  expire date: Mar 21 12:00:00 2022 GMT
*  subjectAltName: host "www.creditkarma.ca" matched cert's "www.creditkarma.ca"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 Extended Validation Server CA
*  SSL certificate verify ok.
> GET / HTTP/1.1
Host: www.creditkarma.ca
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)
Accept: */*

* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: text/html; charset=utf-8
< x-content-security-policy:
< Server: CK-FG-server
< Strict-Transport-Security: max-age=31536000; includeSubdomains; preload
< X-Frame-Options: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< ORIGIN-ENV: production
< ORIGIN-DC: us-east4
< Expires: Wed, 12 Jan 2022 18:20:46 GMT
< Cache-Control: max-age=0, no-cache, no-store
< Pragma: no-cache
< Date: Wed, 12 Jan 2022 18:20:46 GMT
< Transfer-Encoding:  chunked
< Connection: keep-alive
< Connection: Transfer-Encoding
< Set-Cookie: ck_cabf=IjA5MTRmMDQ2LTE3OTAtNDQ5MC1hODA3LWUzZTRlZDcwYTdlYSI=; Max-Age=31536000; Expires=Thu, 12 Jan 2023 18:20:46 GMT; Secure; SameSite=Strict; Path=/
< Set-Cookie: ck_crumb=6da1442eb87cee1a6c0c08c56a9b07826949e3dc130925b0fcb774a83d566b71f5a9b634c4e4f198ae8dc4a6722abf41; Secure; HttpOnly; SameSite=Strict; Path=/
< Set-Cookie: ck_trace_id=5544f4ea-9d03-462b-ab5f-8a81c70c6c81; HttpOnly; SameSite=Strict; Path=/
< Set-Cookie: ck_lang=en; SameSite=Strict; Path=/
<
* Connection #0 to host www.creditkarma.ca left intact
string(63139) "<!DOCTYPE html>
<html>
    <head>
 ..... Rest of page here

文件获取内容或 cURL 获取 404 页面而不是主字符串

File get content or cURL getting 404 page instead of main string

php

curl

file-get-contents

domdocument