Symfony dom-脚本标签中的爬虫字符串转换为UTF8
Symfony dom-crawler string in script tag convert to UTF8
我有这个HTML内容:
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
当我使用 Symfony 的 dom-crawler 时,文本被 HTML 编码。我怎样才能防止这种情况发生? $crawler->html()
结果:
<div>测试</div>
<script>
function drawCharts(){
console.log('测试');
}
让我们看看 symfony/dom-crawler 是如何工作的。这是一个开始的例子:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler($html);
print $crawler->html();
它输出:
<div>æµè¯</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
当您通过 constructor, the Crawler
class does its best to figure out the encoding. If it fails to figure anything out, it falls back to ISO-8859-1
传递内容时;这是 HTTP 1.1 规范定义的默认字符集。
如果您的 HTML 内容正确包含 charset meta tag, the Crawler class will read the charset from it, set it and convert from it。这是上面的相同示例,在 HTML 内容前添加了一个字符集元标记:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<meta charset="utf-8">
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler($html);
print $crawler->html();
现在打印:
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
如果您不想添加字符集元标记,还有另一种方法; addHTMLContent()
方法接受一个字符集作为它的第二个参数,它默认为 UTF-8
。不是通过构造函数传递 HTML 内容,而是首先实例化 class,然后使用此方法添加内容:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler;
// You can safely drop the 2nd argument
$crawler->addHTMLContent($html, 'UTF-8');
print $crawler->html();
现在,如果没有字符集元标记,它会打印:
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
好的,您可能已经知道了所有这些。那么,测试
是怎么回事?为什么 div
内容按原样显示,但 script
标签中的相同内容却被 html 编码?
Symfony 的 Crawler
class, as it explains itself, converts the content to HTML entities due to a bug in DOMDocument::loadHTML()
:
When using loadHTML()
to process UTF-8 pages, you may meet the problem that the output of DOM functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cạnh tranh". I suggest we use mb_convert_encoding
before loading UTF-8 page.
– https://php.net/manual/en/domdocument.loadhtml.php#74777
有人建议在 head 元素中添加一个 HTML4 Content-Type
元标记。其他一些人建议在将 HTML 内容传递给 loadHTML()
之前,在其前面加上 <?xml encoding="UTF-8">
。由于您的 HTML 结构不完整(缺少 head
、body
等),我建议您直接将输出传递给 html_entity_decode()
:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler();
$crawler->addHTMLContent($html, 'UTF-8');
print html_entity_decode($crawler->html());
输出:
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
这就是你想要的。
您可能还想阅读:
PHP DOMDocument loadHTML not encoding UTF-8 correctly
我有这个HTML内容:
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
当我使用 Symfony 的 dom-crawler 时,文本被 HTML 编码。我怎样才能防止这种情况发生? $crawler->html()
结果:
<div>测试</div>
<script>
function drawCharts(){
console.log('测试');
}
让我们看看 symfony/dom-crawler 是如何工作的。这是一个开始的例子:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler($html);
print $crawler->html();
它输出:
<div>æµè¯</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
当您通过 constructor, the Crawler
class does its best to figure out the encoding. If it fails to figure anything out, it falls back to ISO-8859-1
传递内容时;这是 HTTP 1.1 规范定义的默认字符集。
如果您的 HTML 内容正确包含 charset meta tag, the Crawler class will read the charset from it, set it and convert from it。这是上面的相同示例,在 HTML 内容前添加了一个字符集元标记:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<meta charset="utf-8">
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler($html);
print $crawler->html();
现在打印:
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
如果您不想添加字符集元标记,还有另一种方法; addHTMLContent()
方法接受一个字符集作为它的第二个参数,它默认为 UTF-8
。不是通过构造函数传递 HTML 内容,而是首先实例化 class,然后使用此方法添加内容:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler;
// You can safely drop the 2nd argument
$crawler->addHTMLContent($html, 'UTF-8');
print $crawler->html();
现在,如果没有字符集元标记,它会打印:
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
好的,您可能已经知道了所有这些。那么,测试
是怎么回事?为什么 div
内容按原样显示,但 script
标签中的相同内容却被 html 编码?
Symfony 的 Crawler
class, as it explains itself, converts the content to HTML entities due to a bug in DOMDocument::loadHTML()
:
When using
loadHTML()
to process UTF-8 pages, you may meet the problem that the output of DOM functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cạnh tranh". I suggest we usemb_convert_encoding
before loading UTF-8 page.
– https://php.net/manual/en/domdocument.loadhtml.php#74777
有人建议在 head 元素中添加一个 HTML4 Content-Type
元标记。其他一些人建议在将 HTML 内容传递给 loadHTML()
之前,在其前面加上 <?xml encoding="UTF-8">
。由于您的 HTML 结构不完整(缺少 head
、body
等),我建议您直接将输出传递给 html_entity_decode()
:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler();
$crawler->addHTMLContent($html, 'UTF-8');
print html_entity_decode($crawler->html());
输出:
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
这就是你想要的。
您可能还想阅读:
PHP DOMDocument loadHTML not encoding UTF-8 correctly