如何在 Perl CGI 中实现万无一失的 unicode 处理？

Question

所以我有一个 mysql 数据库，它为我备份的旧 wordpress 数据库提供服务。我正在编写一些简单的 perl 脚本来为 wordpress 文章提供服务（我不想安装 wordpress）。

Wordpress 出于某种原因将所有引号存储为 unicode 字符，所有...都存储为 unicode 字符，所有双破折号，所有撇号，到处都是 unicode nbsp——一团糟（这就是为什么我不要安装 wordpress）。

在我的测试环境中，即 Linux Mint 17.1 Perl 5.18.2 Mysql 5.5，当我使用 "charset=utf-8" 提供 Content-type 行时，一切正常（除了撇号，无论我尝试什么组合，撇号都永远不会正确解码）。省略字符集会导致所有 unicode 字符中断（撇号现在可以使用）。没关系，除了撇号，我明白发生了什么，我掌握了数据。

现在在我的虚拟机生产环境上是 Linux CentOS 6.5 Perl 5.10.1 Mysql 5.6.22，这里的东西根本不起作用。我是否在内容类型中包含 "charset=utf-8" 没有区别，没有 unicode 字符可以正常工作（包括撇号）。也许这与较低版本的Perl有关？有没有人有任何见解？

除了这个非常具体的案例，有没有人知道用于处理来自数据库的 unicode 的简单易用的 Perl 习惯用法？（我不确定管道中哪里出了问题，但我怀疑是在数据库驱动程序级别）

其中一个问题是我的数据非常不一致而且很脏。我可以解析整个数据库并清除所有 unicode 并重新导入它——关键是我想避免这种情况。我想要一个通用的 Perl 脚本集合来读取 wordpress 数据库。

Answer 1

处理 Perl 和 UTF-8 对我来说很痛苦。很长一段时间后，我了解到 Perl 中没有 "fool proof unicode handling" ...但是有一个 unicode 处理可以提供帮助：

编码模块。

正如 perlunifaq 所说 (http://perldoc.perl.org/perlunifaq.html):

When should I decode or encode?

Whenever you're communicating text with anything that is external to your perl process, like a database, a text file, a socket, or another program. Even if the thing you're communicating with is also written in Perl.

所以我们对发送到我们的 Perl 进程的每个 UTF-8 文本字符串执行此操作：

my $perl_str = decode('utf8',$myExt_str);

对于从 Perl 发送到我们 Perl 进程外部的任何文本字符串：

my $ext_str = encode('utf8',$perl_str);

...

现在，当我们检索或发送数据 from/to mysql 或 postgresql 数据库时，有很多 encoding/decoding。但不要害怕，因为有一种方法可以告诉 Perl 数据库中的每个文本字符串 from/to 都是 utf8。此外，我们告诉数据库每个文本字符串都应被视为 UTF-8。唯一的缺点是您需要确保每个文本字符串都是 UTF-8 编码的……但那是另一回事了：

# For MySQL:
# This requires DBD::mysql version 4 or greater
use DBI;
my $dbh = DBI->connect ('dbi:mysql:test_db',
    $username,
    $password,
    {mysql_enable_utf8 => 1}
);

好的，现在我们的数据库中有 utf8 格式的文本字符串，数据库知道我们所有的文本字符串都应该被视为 UTF-8...但是还有其他的吗？我们需要告诉 Perl（和 CGI）我们在进程中写入的每个文本字符串都是 utf8，并告诉其他进程也将我们的文本字符串视为 UTF-8：

use utf8;
use CGI '-utf8';

my $cgi = new CGI;
$cgi->charset('UTF-8');

已更新！

What is a "wide character"?

This is a term used both for characters with an ordinal value greater than 127, characters with an ordinal value greater than 255, or any character occupying more than one byte, depending on the context. The Perl warning "Wide character in ..." is caused by a character with an ordinal value greater than 255.

With no specified encoding layer, Perl tries to fit things in ISO-8859-1 for backward compatibility reasons. When it can't, it emits this warning (if warnings are enabled), and outputs UTF-8 encoded data instead. To avoid this warning and to avoid having different output encodings in a single stream, always specify an encoding explicitly, for example with a PerlIO layer:

# The next line is required to avoid the "Wide character in print" warning
# AND to avoid having different output encodings in a single stream.
binmode STDOUT, ":encoding(UTF-8)";

...

即使有了所有这些，有时您也需要编码('utf8',$perl_str) 。这就是为什么我知道在 Perl 中没有万无一失的 unicode 处理。请阅读 perlunifaq (http://perldoc.perl.org/perlunifaq.html)

希望对您有所帮助。

如何在 Perl CGI 中实现万无一失的 unicode 处理？

How to achieve fool proof unicode handling in Perl CGI?

unicode

perl

已更新！