PHP 编码 Windows-1257 到 UTF-8 错误

PHP encoding Windows-1257 to UTF-8 error

我在将 Windows-1257 文件转换为 UTF-8 时遇到问题。原文件有 <?xml version="1.0" encoding="windows-1257"?> 在上面,我尝试使用此代码转换它:

iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "ISO-8859-1");

$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
file_put_contents('data/rmtools/import/utf8/'.$files_single, $unicode_xml);

它将文件保存为UTF-8,但是当我打开这个文件时,我仍然得到错误:

XML parsing error: Input is not proper UTF-8, indicate encoding ! Bytes: 0x04 0x50 0x72 0x65

有什么合适的方法可以将它转换为可读的 UTF-8,或者这意味着文件中仍有一些符号不在 UTF-8 上?

您正在尝试将 UTF8 转换为 UTF8//IGNORE,这就是您收到该错误的原因。第一个参数是in_charset。 iconv on PHP.net请更改

$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);

$unicode_xml = iconv("CP1257", "UTF-8//IGNORE", $baltic_xml);

不过,我个人建议您使用 mb_*,因为 iconv 在很大程度上依赖于您的 OS 的 iconv 实现,并且可以显示 OS、mb_* 之间的差异另一方面是纯 php 扩展并且是一致的。让您的代码使用 mb_* 整体更改为

ini_set('mbstring.substitute_character','none'); //to remove the unknown characters, in place of //IGNORE in iconv
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
$unicode_xml = utf8_encode($unicode_xml); //to correct utf-8 bytes
$unicode_xml = preg_replace('/[^\PC\s]/u', '', $unicode_xml); //to remove control chars in case it has
file_put_contents('data/rmtools/import/utf8/' . $files_single, $unicode_xml);

According to mb supported encodings CP-1257 is not one of them, you may use ISO-8859-13 instead, however please note that there are some inconsistencies between them in some graphical characters (language characters however seem to be consistent according to wikipedia )