XML 编码错误，但 XML 和输入文本编码在 php 中都是 utf-8

Question

我正在生成 XML Dom Dom 文档 php，其中包含一些新闻、标题、日期、链接和描述。问题出现在一些新闻的描述上，但在其他新闻的描述上没有，并且它们都包含重音和变音。

我在 UTF-8 中创建 XML Dom 元素：

$dom = new \DOMDocument("1.0", "UTF-8");

然后，我从 MySQL 数据库中检索我的文本，该数据库以 latin-1 编码，并且在我使用 mb_detect_encoding 测试编码后 returns UTF-8。

我尝试了以下方法：

iconv('UTF-8', 'ISO-8859-1', $descricao);
iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $descricao);
iconv('ISO-8859-1', 'UTF-8', $descricao);
iconv('ISO-8859-1//TRANSLIT', 'UTF-8', $descricao);
mb_convert_encoding($descricao, 'ISO-8859-1', 'UTF-8');
mb_convert_encoding($descricao, 'UTF-8', 'ISO-8859-1');
mb_convert_encoding($descricao, 'UTF-8', 'UTF-8'); //that makes no sense, but who knows

还尝试将数据库编码更改为 UTF-8，并将 XML 编码更改为 ISO-8859-1。

这是生成 XML:

的完整方法

$informativos = Informativo::where('inf_ativo','S')->orderBy('inf_data','DESC')->take(20)->get();
$dom = new \DOMDocument("1.0", "UTF-8");
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$rss = $dom->createElement("rss");

$channel = $dom->createElement("channel");
$title = $dom->createElement("title", "Informativos");
$link = $dom->createElement("link", "http://example.com/informativos");

$channel->appendChild($title);
$channel->appendChild($link);

foreach ($informativos as $informativo) {
    $item = $dom->createElement("item");

    $itemTitle = $dom->createElement("title", $informativo->inf_titulo);
    $itemImage = $dom->createElement("image", "http://example.com/".$informativo->inf_ilustracao);
    $itemLink = $dom->createElement("link", "http://example.com/informativo/".$informativo->informativo_id);
    $descricao = strip_tags($informativo->inf_descricao);
    $descricao = str_replace("&nbsp;", " ", $descricao);
    $descricao = str_replace("&#13;", " ", $descricao);
    $descricao = substr($descricao, 0, 150).'...';
    $itemDesc = $dom->createElement("description", $descricao);
    $itemDate = $dom->createElement("pubDate", $informativo->inf_data);

    $item->appendChild($itemTitle);
    $item->appendChild($itemImage);
    $item->appendChild($itemLink);
    $item->appendChild($itemDesc);
    $item->appendChild($itemDate);

    $channel->appendChild($item);
}

$rss->appendChild($channel);

$dom->appendChild($rss);

return $dom->saveXML();

这里是成功输出的例子：

Segundo a instituição, número de pessoas que vivem na pobreza subiu 7,3 milhões desde 2014, atingindo 21% da população, ou 43,5 milhões de br

还有一个给出编码错误的例子：

procuradores da Lava Jato em Curitiba, que estão sendo investigados por um&#13;
suposto acordo fraudulento com a Petrobras e o Departamento de Justi�...

一切都很好，直到上面的有问题的描述文本给我：

“此页面包含以下错误：第 118 行第 20 列的错误：编码错误下面是第一个错误之前的页面呈现。"

可能  就是问题所在。由于我无法控制文本是否有这个，所以我需要正确渲染这些特殊字符。

更新 2019-04-12： 发现有问题的文本中的错误并更改了示例。

Answer 1

数据库连接的编码很重要。确保将其设置为 UTF-8。大多数时候（对于您的字段）使用 UTF-8 是个好主意。像 ISO-8859-1 这样的字符集只有非常有限的字符。因此，如果将 Unicode 字符串编码到其中，它可能会丢失数据。

DOMDocument::createElement() 的第二个参数已损坏。 In 仅编码一些特殊字符，但不编码 &。为避免出现问题，请将内容创建并附加为单独的文本节点。但是 DOMNode::appendChild() returns 添加节点，因此 DOMElement::create* 方法可以嵌套和链接。

$data = [
  [
    'inf_titulo' => 'Foo',
    'inf_ilustracao' => 'foo.jpg',
    'informativo_id' => 42,
    'inf_descricao' => 'Some content',
    'inf_data' => 'a-date'
  ]  
];
$informativos = json_decode(json_encode($data));

function stripTagsAndTruncate($text) {
    $text = strip_tags($text);
    $text = str_replace(["&nbsp;", "&#13;"], " ", $text);
    return substr($text, 0, 150).'...';
}

$dom = new DOMDocument('1.0', 'UTF-8');
$rss = $dom->appendChild($dom->createElement('rss'));
$channel = $rss->appendChild($dom->createElement("channel"));
$channel
  ->appendChild($dom->createElement("title"))
  ->appendChild($dom->createTextNode("Informativos"));
$channel
  ->appendChild($dom->createElement("link"))
  ->appendChild($dom->createTextNode("http://example.com/informativos"));

foreach ($informativos as $informativo) {
    $item = $channel->appendChild($dom->createElement("item"));

    $item
      ->appendChild($dom->createElement("title"))
      ->appendChild($dom->createTextNode($informativo->inf_titulo));
    $item
      ->appendChild($dom->createElement("image"))
      ->appendChild($dom->createTextNode("http://example.com/".$informativo->inf_ilustracao));
    $item
      ->appendChild($dom->createElement("link"))
      ->appendChild($dom->createTextNode("http://example.com/informativo/".$informativo->informativo_id));
    $item
      ->appendChild($dom->createElement("description"))
      ->appendChild($dom->createTextNode(stripTagsAndTruncate($informativo->inf_descricao)));
    $item
      ->appendChild($dom->createElement("pubDate"))
      ->appendChild($dom->createTextNode($informativo->inf_data));
}
$dom->formatOutput = TRUE;
echo $dom->saveXML();

输出：

<?xml version="1.0" encoding="UTF-8"?> 
<rss>
  <channel>
    <title>Informativos</title> 
    <link>http://example.com/informativos</link> 
    <item> 
      <title>Foo</title> 
      <image>http://example.com/foo.jpg</image> 
      <link>http://example.com/informativo/42</link> 
      <description>Some content...</description> 
      <pubDate>a-date</pubDate> 
    </item> 
  </channel> 
</rss>

截断 HTML 片段可能会导致损坏的实体和损坏的代码点（如果您不使用支持 UTF-8 的字符串函数）。这里有两种解决方法。

您可以在UTF-8模式下使用PCRE并匹配n entities/codepoints:

// have some string with HTML and entities
$text = 'Hello<b>äöü</b>&nbsp;&auml;&#13; foobar';

// strip tags and replace some specific entities with spaces
$stripped = str_replace(['&nbsp;', '&#13;'], ' ', strip_tags($text));
// match 0-10 entities or unicode codepoints
preg_match('(^(?:&[^;]+;|\X){0,10})u', $stripped, $match);
var_dump($match[0]);

输出：

string(18) "Helloäöü &auml;"

不过我建议使用 DOM。它可以加载 HTML 并允许在其上使用 Xpath 表达式。

// have some string with HTML and entities
$text = 'Hello<b>äöü</b>&nbsp;&auml;&#13; foobar';

$document = new DOMDocument();
// force UTF-8 and load
$document->loadHTML('<?xml encoding="UTF-8"?>'.$text);
$xpath = new DOMXpath($document);
// use xpath to fetch the first 10 characters of the text content
var_dump($xpath->evaluate('substring(//body, 1, 10)'));

输出：

string(15) "Helloäöü ä"

DOM 通常将所有字符串视为 UTF-8。所以代码点不是问题。 Xpaths substring() 处理第一个匹配节点的文本内容。参数是字符位置（不是索引）所以它们以 1.

开头

DOMDocument::loadHTML() 将添加 html 和 body 标签并解码实体。结果会比使用 PCRE 方法更清晰一些。

XML 编码错误，但 XML 和输入文本编码在 php 中都是 utf-8

XML encoding error, but both XML and input text encoding are utf-8 in php

php

xml

laravel

utf-8

domdocument