为什么 file_get_contents 返回乱码数据？

Question

我正在尝试使用一些简单的 php 从下面的页面中获取 HTML。

URL: https://kat.cr/usearch/architecture%20category%3Abooks/

我的代码是：

$html = file_get_contents('https://kat.cr/usearch/architecture%20category%3Abooks/');
echo $html;

其中 file_get_contents 有效，但 returns 加扰数据：

我尝试过使用 cUrl 以及各种函数，例如：htmlentities(), mb_convert_encoding、utf8_encode 等等，但只是得到了乱序文本的不同变体.

页面的来源说是charset=utf-8，但我不确定是什么问题。

在基地上调用file_get_contents()urlkat.crreturns同样一团糟。

我在这里错过了什么？

Answer 1

它是GZ压缩的，当被浏览器获取时，浏览器会解压它，所以你需要解压。要输出它，您可以使用 readgzfile():

readgzfile('https://kat.cr/usearch/architecture%20category%3Abooks/');

Answer 2

您的站点响应正在压缩，因此您必须解压缩才能将其转换为原始形式。

最快的方法是使用 gzinflate()，如下所示：

$html = gzinflate(substr(file_get_contents("https://kat.cr/usearch/architecture%20category%3Abooks/"), 10, -8));

或者对于更高级的解决方案，请考虑以下函数（在此 blog 中找到）：

function get_url($url)
{
    //user agent is very necessary, otherwise some websites like google.com wont give zipped content
    $opts = array(
        'http'=>array(
            'method'=>"GET",
            'header'=>"Accept-Language: en-US,en;q=0.8rn" .
                        "Accept-Encoding: gzip,deflate,sdchrn" .
                        "Accept-Charset:UTF-8,*;q=0.5rn" .
                        "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0 FirePHP/0.4rn"
        )
    );

    $context = stream_context_create($opts);
    $content = file_get_contents($url ,false,$context); 

    //If http response header mentions that content is gzipped, then uncompress it
    foreach($http_response_header as $c => $h)
    {
        if(stristr($h, 'content-encoding') and stristr($h, 'gzip'))
        {
            //Now lets uncompress the compressed data
            $content = gzinflate( substr($content,10,-8) );
        }
    }

    return $content;
}

echo get_url('http://www.google.com/');

为什么 file_get_contents 返回乱码数据？

Why file_get_contents returning garbled data?

php

httpresponse

file-get-contents

inflate