PHP DomDocument - 脚本标签内的中文字符格式错误
PHP DomDocument - Chinese characters inside script tag malformed
我正在尝试解析一个简单的 HTML,其中包含脚本标记内的中文字符。然而,经过PHP DomDocument 处理后,那些被转换成一些奇怪的字符。
<?php
$html = <<<EOD
<!DOCTYPE html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html>
EOD;
$dom = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$dom->loadHTML($html);
// Trying different approaches to get correct output
echo $dom->saveHTMl();
echo $dom->saveHTML($dom->documentElement);
echo utf8_decode($dom->saveHTML($dom->documentElement));
echo utf8_decode($dom->saveHTML());
输出:
<!DOCTYPE html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html><html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html><!DOCTYPE html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html>
似乎没有 mb_convert_encoding
:
<?php
$html = <<<EOD
<!DOCTYPE html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html>
EOD;
$dom = new DOMDocument();
$dom->loadHTML($html);
echo utf8_decode($dom->saveHTML($dom->documentElement));
结果:
<html>
<head><script>
const str = "訂閱最新指南";
</script></head>
<body>
</body>
</html>
与 mb_convert_encoding
:
<?php
$html = <<<EOD
<!DOCTYPE html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html>
EOD;
$dom = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$dom->loadHTML($html);
echo html_entity_decode($dom->saveHTML($dom->documentElement));
结果:
<html><head><script>
const str = "訂閱最新指南";
</script></head><body>
</body></html>
我正在尝试解析一个简单的 HTML,其中包含脚本标记内的中文字符。然而,经过PHP DomDocument 处理后,那些被转换成一些奇怪的字符。
<?php
$html = <<<EOD
<!DOCTYPE html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html>
EOD;
$dom = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$dom->loadHTML($html);
// Trying different approaches to get correct output
echo $dom->saveHTMl();
echo $dom->saveHTML($dom->documentElement);
echo utf8_decode($dom->saveHTML($dom->documentElement));
echo utf8_decode($dom->saveHTML());
输出:
<!DOCTYPE html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html><html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html><!DOCTYPE html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html>
似乎没有 mb_convert_encoding
:
<?php
$html = <<<EOD
<!DOCTYPE html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html>
EOD;
$dom = new DOMDocument();
$dom->loadHTML($html);
echo utf8_decode($dom->saveHTML($dom->documentElement));
结果:
<html>
<head><script>
const str = "訂閱最新指南";
</script></head>
<body>
</body>
</html>
与 mb_convert_encoding
:
<?php
$html = <<<EOD
<!DOCTYPE html>
<html>
<head>
<script>
const str = "訂閱最新指南";
</script>
</head>
<body>
</body>
</html>
EOD;
$dom = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$dom->loadHTML($html);
echo html_entity_decode($dom->saveHTML($dom->documentElement));
结果:
<html><head><script>
const str = "訂閱最新指南";
</script></head><body>
</body></html>