使用 PHP/COM 将 UTF-8 字符串写入 Word
Writing UTF-8 strings to Word using PHP/COM
我正在尝试使用 PHP/COM 使用 MySQL 数据库中的数据生成 Word 文档。如果数据库中的数据是简单的 ASCII 文本(例如 "hello"),它会在 Word 文档中正确显示。如果数据包含非 ASCII( 多字节 )字符(例如 "Māori"),它们会正确显示,但末尾有 "funny" 个字符(例如为NULL、空格或中文符号)。
环境:我正在使用 Windows 7 Enterprise、Apache、MySQL、PHP 5.2.17 和 Microsoft Office 2010 .
这是一个简化的例子——我什至不使用数据库或写入 Word 文档,只是简单地使用 Word CleanString 方法来重现问题:
private function _cleanString($wordApp, $str)
{
$vStr = new VARIANT($str, VT_BSTR, CP_UTF8);
$bytes = strlen($vStr);
$chars = mb_strlen($vStr, "UTF-8");
echo "Test string: $vStr (bytes=$bytes, chars=$chars)<br/>";
$vStr = $wordApp->CleanString($vStr);
$bytes = strlen($vStr);
$chars = mb_strlen($vStr, "UTF-8");
echo "Test string (after cleaning): $vStr (bytes=$bytes, chars=$chars)<br/>";
echo "<br/>";
}
public function testUtf8Strings()
{
com_load_typelib('Word.Application');
// Specifying codepage as CP_UTF8 to let COM/Word know strings I pass in will be in UTF-8 format.
$wordApp = new COM("word.application", null, CP_UTF8) or die ("couldn't create an instance of word");
echo "Loaded Word, version {$wordApp->Version} <br/>";
$wordApp->visible = false;
echo "<br/>";
$this->_cleanString($wordApp, 'No multi-byte characters.');
$this->_cleanString($wordApp, 'Multi-byte chars: Māori 楠 test.');
$this->_cleanString($wordApp, 'Multi-byte chars: Ā ā Ē ē Ī.');
$wordApp->Quit(false); // Imortant: must say 'false', otherwise Word does not close
$wordApp = null;
echo "Quit Word.";
return;
}
HTML 输出为:
Loaded Word, version 14.0
Test string: No multi-byte characters. (bytes=25, chars=25)
Test string (after cleaning): No multi-byte characters. (bytes=25, chars=25)
Test string: Multi-byte chars: Māori 楠 test. (bytes=34, chars=31)
Test string (after cleaning): Multi-byte chars: Māori 楠 test. 5 (bytes=39, chars=34)
Test string: Multi-byte chars: Ā ā Ē ē Ī. (bytes=33, chars=28)
Test string (after cleaning): Multi-byte chars: Ā ā Ē ē Ī. 琠獥㔠 (bytes=46, chars=33)
Quit Word.
CleanString 方法从给定的字符串中删除非打印字符并将它们更改为空格。由于我的字符串已经是 "clean",我希望得到相同的字符串。当我的字符串具有多字节字符时,情况就不是这样了。看起来 Word 使用原始字符串中的字节数作为返回字符串中的字符数。
事实证明这是 PHP 错误 (https://bugs.php.net/bug.php?id=66431),已在 PHP 5.4.29 中修复。我用 PHP 5.5.19 测试,问题不再出现。 HTML 输出为:
Loaded Word, version 14.0
Test string: No multi-byte characters. (bytes=25, chars=25)
Test string (after cleaning): No multi-byte characters. (bytes=25, chars=25)
Test string: Multi-byte chars: Māori 楠 test. (bytes=34, chars=31)
Test string (after cleaning): Multi-byte chars: Māori 楠 test. (bytes=34, chars=31)
Test string: Multi-byte chars: Ā ā Ē ē Ī. (bytes=33, chars=28)
Test string (after cleaning): Multi-byte chars: Ā ā Ē ē Ī. (bytes=33, chars=28)
Quit Word.
我正在尝试使用 PHP/COM 使用 MySQL 数据库中的数据生成 Word 文档。如果数据库中的数据是简单的 ASCII 文本(例如 "hello"),它会在 Word 文档中正确显示。如果数据包含非 ASCII( 多字节 )字符(例如 "Māori"),它们会正确显示,但末尾有 "funny" 个字符(例如为NULL、空格或中文符号)。
环境:我正在使用 Windows 7 Enterprise、Apache、MySQL、PHP 5.2.17 和 Microsoft Office 2010 .
这是一个简化的例子——我什至不使用数据库或写入 Word 文档,只是简单地使用 Word CleanString 方法来重现问题:
private function _cleanString($wordApp, $str)
{
$vStr = new VARIANT($str, VT_BSTR, CP_UTF8);
$bytes = strlen($vStr);
$chars = mb_strlen($vStr, "UTF-8");
echo "Test string: $vStr (bytes=$bytes, chars=$chars)<br/>";
$vStr = $wordApp->CleanString($vStr);
$bytes = strlen($vStr);
$chars = mb_strlen($vStr, "UTF-8");
echo "Test string (after cleaning): $vStr (bytes=$bytes, chars=$chars)<br/>";
echo "<br/>";
}
public function testUtf8Strings()
{
com_load_typelib('Word.Application');
// Specifying codepage as CP_UTF8 to let COM/Word know strings I pass in will be in UTF-8 format.
$wordApp = new COM("word.application", null, CP_UTF8) or die ("couldn't create an instance of word");
echo "Loaded Word, version {$wordApp->Version} <br/>";
$wordApp->visible = false;
echo "<br/>";
$this->_cleanString($wordApp, 'No multi-byte characters.');
$this->_cleanString($wordApp, 'Multi-byte chars: Māori 楠 test.');
$this->_cleanString($wordApp, 'Multi-byte chars: Ā ā Ē ē Ī.');
$wordApp->Quit(false); // Imortant: must say 'false', otherwise Word does not close
$wordApp = null;
echo "Quit Word.";
return;
}
HTML 输出为:
Loaded Word, version 14.0
Test string: No multi-byte characters. (bytes=25, chars=25)
Test string (after cleaning): No multi-byte characters. (bytes=25, chars=25)
Test string: Multi-byte chars: Māori 楠 test. (bytes=34, chars=31)
Test string (after cleaning): Multi-byte chars: Māori 楠 test. 5 (bytes=39, chars=34)
Test string: Multi-byte chars: Ā ā Ē ē Ī. (bytes=33, chars=28)
Test string (after cleaning): Multi-byte chars: Ā ā Ē ē Ī. 琠獥㔠 (bytes=46, chars=33)
Quit Word.
CleanString 方法从给定的字符串中删除非打印字符并将它们更改为空格。由于我的字符串已经是 "clean",我希望得到相同的字符串。当我的字符串具有多字节字符时,情况就不是这样了。看起来 Word 使用原始字符串中的字节数作为返回字符串中的字符数。
事实证明这是 PHP 错误 (https://bugs.php.net/bug.php?id=66431),已在 PHP 5.4.29 中修复。我用 PHP 5.5.19 测试,问题不再出现。 HTML 输出为:
Loaded Word, version 14.0
Test string: No multi-byte characters. (bytes=25, chars=25)
Test string (after cleaning): No multi-byte characters. (bytes=25, chars=25)
Test string: Multi-byte chars: Māori 楠 test. (bytes=34, chars=31)
Test string (after cleaning): Multi-byte chars: Māori 楠 test. (bytes=34, chars=31)
Test string: Multi-byte chars: Ā ā Ē ē Ī. (bytes=33, chars=28)
Test string (after cleaning): Multi-byte chars: Ā ā Ē ē Ī. (bytes=33, chars=28)
Quit Word.