使用 Perl DBI + DBD::mysql 读取 JSON 类型的 MySQL 列(相对于 TEXT)时进行 UTF-8 解码
UTF-8 decoding when reading a MySQL column of type JSON (vs. TEXT), using Perl DBI + DBD::mysql
这是工作单元测试中的问题。我认为它要么是 DBI + DBD::mysql 中的错误,关于它如何处理 MySQL JSON 列,要么是我大脑中的错误。
use strict;
use warnings;
use utf8;
use Test2::V0;
use Test2::Plugin::UTF8;
use Test2::Plugin::NoWarnings echo => 1;
use DBI;
use DBD::mysql 4.041; # 4.041+ required for utf8mb4
use JSON 4.01 qw//;
use Encode;
#
# setup $dbh, create test table
#
my $dbname = '';
my $host = 'localhost';
my $user = '';
my $pass = '';
my $dbh = DBI->connect(
"DBI:mysql:" . ($dbname || '') . ";host=" . $host,
$user,
$pass || undef,
{ RaiseError => 1, PrintError => 0, AutoCommit=> 1 }
);
$dbh->{'mysql_enable_utf8mb4'} = 1;
$dbh->{'charset'} = 'utf8';
$dbh->do(
"CREATE TABLE IF NOT EXISTS `test` ("
. "id int unsigned, "
. "`my_json` json NOT NULL, "
. "`my_text` mediumtext NOT NULL "
. ") ENGINE=InnoDB "
. "DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci"
);
#
# create and insert test data
#
# A use case for spam! Got this junk from spam inbox
my $utf8str = "ion?• été eulière vs• Ch• ⭐Sho是ab 期待您x";
my $hash = { test => $utf8str };
my $json = JSON->new->encode( $hash );
my $id = time;
$dbh->do("INSERT INTO test SET id=?, my_json=?, my_text=?", undef, $id, $json, $json);
#
# retrieve test data and check it
#
my ( $my_json, $my_text ) = $dbh->selectrow_array("SELECT my_json, my_text FROM test WHERE id=$id");
is( $my_text, $json ); # ok
is( $my_json, $json ); # fails. got {"test": "ion?\nâ\N{U+80}¢ ét ....
is( decode('UTF-8', $my_json), $json ); # ok'ish. mysql adds a space between "test":"..." but value looks ok
#
# another test, independent of JSON encoder, using hand-built json
#
$id++;
$json = '{"test":"' . $utf8str . '"}';
$dbh->do("INSERT INTO test SET id=?, my_json=?, my_text=?", undef, $id, $json, $json);
( $my_json, $my_text ) = $dbh->selectrow_array("SELECT my_json, my_text FROM test WHERE id=$id");
is( $my_text, $json ); # ok
is( $my_json, $json ); # fails. got {"test": "ion?\nâ\N{U+80}¢ ét ....
printf "%vX", $my_json; # 7B.22.74.65.73.74.22.3A.20.22.69.6F.6E.3F.E2.80.A2.20.C3.A9.74.C3.A9.20.65.F0.9F.98.8D.F0.9F.92.8B.F0.9F.94.A5.75.6C.69.C3.A8.72.65.20.76.73.E2.80.A2.20.43.68.E2.80.A2.20.F0.9F.98.8A.E2.AD.90.F0.9F.91.89.F0.9F.8F.BB.F0.9F.94.9E.F0.9F.8D.86.53.68.6F.E6.98.AF.61.62.20.E6.9C.9F.E5.BE.85.E6.82.A8.78.22.7D
printf "%vX", $json; # 7B.22.74.65.73.74.22.3A.22.69.6F.6E.3F.2022.20.E9.74.E9.20.65.1F60D.1F48B.1F525.75.6C.69.E8.72.65.20.76.73.2022.20.43.68.2022.20.1F60A.2B50.1F449.1F3FB.1F51E.1F346.53.68.6F.662F.61.62.20.671F.5F85.60A8.78.22.7D
is( decode('UTF-8', $my_json), $json ); # ok'ish. mysql adds a space between "test":"..." but value looks ok
#
# cleanup
#
$dbh->do("DROP TABLE `test`");
$dbh->disconnect();
done_testing();
我的理解是 JSON 标准需要 UTF-8。此外,MySQL 也 requires/uses UTF-8 关于 JSON 列,如下所述:
MySQL handles strings used in JSON context using the utf8mb4 character
set and utf8mb4_bin collation. Strings in other character sets are
converted to utf8mb4 as necessary.
(https://dev.mysql.com/doc/refman/8.0/en/json.html)
我的理解也是 DBI 处理 UTF-8 encoding/decoding,并且应该返回解码后的 UTF-8,就像它对 mediumtext 列所做的那样,如下所述:
This attribute determines whether DBD::mysql should assume strings stored
in the database are utf8. This feature defaults to off.
When set, a data retrieved from a textual column type (char, varchar,
etc) will have the UTF-8 flag turned on if necessary. This enables
character semantics on that string.
https://metacpan.org/pod/DBD::mysql#mysql_enable_utf8
但是,它似乎不适用于 JSON 列。从 JSON 列检索数据后似乎需要显式解码。
那是什么... DBI/DBD::mysql 中的错误还是我大脑中的错误?
编辑:好消息,这不是我的大脑。坏消息,似乎是一个已知的错误。 https://github.com/perl5-dbi/DBD-mysql/issues/309
因此,我现在寻求的答案是向后兼容的解决方法,即不会破坏 if/when DBD::mysql 的解决方法已修复。双重解码不好
So, the answer I'm seeking now is a backward-compatible workaround,
i.e., a workaround that won't break if/when DBD::mysql is fixed.
Double-decoding would not be good.
您可以尝试通过创建测试 table 来确定是否存在 JSON 解码错误,您可以在其中插入具有已知 UTF-8 编码且字节长度更大的非 ascii 字符比一个。例如:
$dbh->do("DROP TABLE IF EXISTS json_decode_test");
$dbh->do("CREATE TABLE json_decode_test (id int unsigned, `my_json` json NOT NULL)");
my $unicode_str = "是"; # This character will always have a UTF-8 encoding with
# byte length > 1
my $hash = { test_str => $unicode_str };
my $json = JSON->new;
my $json_str = $json->encode( $hash );
my $id = time;
my $attrs = undef;
$dbh->do("INSERT INTO json_decode_test SET id=?, my_json=?", $attrs, $id, $json_str);
my ( $json_str2 ) = $dbh->selectrow_array(
"SELECT my_json FROM json_decode_test WHERE id=$id");
my $hash2 = $json->decode( $json_str2 );
my $unicode_str2 = $hash2->{test_str};
# If the json unicode bug is present, $unicode_str2 will not be decoded. Instead
# it will be a string of length 3 representing the UTF-8 encoding of $unicode_str
# (printf "%vX\n", $unicode_str2) gives output : E6.98.AF
my $json_unicode_bug = (length $unicode_str2) > 1;
if ( $json_unicode_bug ) {
say "unicode bug is present..";
# need to run decode_utf8() on every returned json object from DBI
}
更改 SQL 创建 test
table 的请求如下
my $query = "
CREATE TABLE IF NOT EXISTS `test` (
`id` INT UNSIGNED,
`my_json` JSON NOT NULL,
`my_text` MEDIUMTEXT NOT NULL
) ENGINE=InnoDB DEFAULT
CHARSET=utf8mb4
COLLATE=utf8mb4_unicode_ci
";
$dbh->do($query);
F0.9F.98.8D -- UTF-8 encoding -- this is good
1F60D -- Unicode codepoint -- not useful in `MEDIUMTEXT`
做
SELECT HEX(my_json) FROM test WHERE id = ...
看看里面有什么MySQL;应该是 F09F988D
.
您不应编码或解码写入或读取自 MySQL 的任何字符串。
这是工作单元测试中的问题。我认为它要么是 DBI + DBD::mysql 中的错误,关于它如何处理 MySQL JSON 列,要么是我大脑中的错误。
use strict;
use warnings;
use utf8;
use Test2::V0;
use Test2::Plugin::UTF8;
use Test2::Plugin::NoWarnings echo => 1;
use DBI;
use DBD::mysql 4.041; # 4.041+ required for utf8mb4
use JSON 4.01 qw//;
use Encode;
#
# setup $dbh, create test table
#
my $dbname = '';
my $host = 'localhost';
my $user = '';
my $pass = '';
my $dbh = DBI->connect(
"DBI:mysql:" . ($dbname || '') . ";host=" . $host,
$user,
$pass || undef,
{ RaiseError => 1, PrintError => 0, AutoCommit=> 1 }
);
$dbh->{'mysql_enable_utf8mb4'} = 1;
$dbh->{'charset'} = 'utf8';
$dbh->do(
"CREATE TABLE IF NOT EXISTS `test` ("
. "id int unsigned, "
. "`my_json` json NOT NULL, "
. "`my_text` mediumtext NOT NULL "
. ") ENGINE=InnoDB "
. "DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci"
);
#
# create and insert test data
#
# A use case for spam! Got this junk from spam inbox
my $utf8str = "ion?• été eulière vs• Ch• ⭐Sho是ab 期待您x";
my $hash = { test => $utf8str };
my $json = JSON->new->encode( $hash );
my $id = time;
$dbh->do("INSERT INTO test SET id=?, my_json=?, my_text=?", undef, $id, $json, $json);
#
# retrieve test data and check it
#
my ( $my_json, $my_text ) = $dbh->selectrow_array("SELECT my_json, my_text FROM test WHERE id=$id");
is( $my_text, $json ); # ok
is( $my_json, $json ); # fails. got {"test": "ion?\nâ\N{U+80}¢ ét ....
is( decode('UTF-8', $my_json), $json ); # ok'ish. mysql adds a space between "test":"..." but value looks ok
#
# another test, independent of JSON encoder, using hand-built json
#
$id++;
$json = '{"test":"' . $utf8str . '"}';
$dbh->do("INSERT INTO test SET id=?, my_json=?, my_text=?", undef, $id, $json, $json);
( $my_json, $my_text ) = $dbh->selectrow_array("SELECT my_json, my_text FROM test WHERE id=$id");
is( $my_text, $json ); # ok
is( $my_json, $json ); # fails. got {"test": "ion?\nâ\N{U+80}¢ ét ....
printf "%vX", $my_json; # 7B.22.74.65.73.74.22.3A.20.22.69.6F.6E.3F.E2.80.A2.20.C3.A9.74.C3.A9.20.65.F0.9F.98.8D.F0.9F.92.8B.F0.9F.94.A5.75.6C.69.C3.A8.72.65.20.76.73.E2.80.A2.20.43.68.E2.80.A2.20.F0.9F.98.8A.E2.AD.90.F0.9F.91.89.F0.9F.8F.BB.F0.9F.94.9E.F0.9F.8D.86.53.68.6F.E6.98.AF.61.62.20.E6.9C.9F.E5.BE.85.E6.82.A8.78.22.7D
printf "%vX", $json; # 7B.22.74.65.73.74.22.3A.22.69.6F.6E.3F.2022.20.E9.74.E9.20.65.1F60D.1F48B.1F525.75.6C.69.E8.72.65.20.76.73.2022.20.43.68.2022.20.1F60A.2B50.1F449.1F3FB.1F51E.1F346.53.68.6F.662F.61.62.20.671F.5F85.60A8.78.22.7D
is( decode('UTF-8', $my_json), $json ); # ok'ish. mysql adds a space between "test":"..." but value looks ok
#
# cleanup
#
$dbh->do("DROP TABLE `test`");
$dbh->disconnect();
done_testing();
我的理解是 JSON 标准需要 UTF-8。此外,MySQL 也 requires/uses UTF-8 关于 JSON 列,如下所述:
MySQL handles strings used in JSON context using the utf8mb4 character set and utf8mb4_bin collation. Strings in other character sets are converted to utf8mb4 as necessary. (https://dev.mysql.com/doc/refman/8.0/en/json.html)
我的理解也是 DBI 处理 UTF-8 encoding/decoding,并且应该返回解码后的 UTF-8,就像它对 mediumtext 列所做的那样,如下所述:
This attribute determines whether DBD::mysql should assume strings stored in the database are utf8. This feature defaults to off.
When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string. https://metacpan.org/pod/DBD::mysql#mysql_enable_utf8
但是,它似乎不适用于 JSON 列。从 JSON 列检索数据后似乎需要显式解码。
那是什么... DBI/DBD::mysql 中的错误还是我大脑中的错误?
编辑:好消息,这不是我的大脑。坏消息,似乎是一个已知的错误。 https://github.com/perl5-dbi/DBD-mysql/issues/309
因此,我现在寻求的答案是向后兼容的解决方法,即不会破坏 if/when DBD::mysql 的解决方法已修复。双重解码不好
So, the answer I'm seeking now is a backward-compatible workaround, i.e., a workaround that won't break if/when DBD::mysql is fixed. Double-decoding would not be good.
您可以尝试通过创建测试 table 来确定是否存在 JSON 解码错误,您可以在其中插入具有已知 UTF-8 编码且字节长度更大的非 ascii 字符比一个。例如:
$dbh->do("DROP TABLE IF EXISTS json_decode_test");
$dbh->do("CREATE TABLE json_decode_test (id int unsigned, `my_json` json NOT NULL)");
my $unicode_str = "是"; # This character will always have a UTF-8 encoding with
# byte length > 1
my $hash = { test_str => $unicode_str };
my $json = JSON->new;
my $json_str = $json->encode( $hash );
my $id = time;
my $attrs = undef;
$dbh->do("INSERT INTO json_decode_test SET id=?, my_json=?", $attrs, $id, $json_str);
my ( $json_str2 ) = $dbh->selectrow_array(
"SELECT my_json FROM json_decode_test WHERE id=$id");
my $hash2 = $json->decode( $json_str2 );
my $unicode_str2 = $hash2->{test_str};
# If the json unicode bug is present, $unicode_str2 will not be decoded. Instead
# it will be a string of length 3 representing the UTF-8 encoding of $unicode_str
# (printf "%vX\n", $unicode_str2) gives output : E6.98.AF
my $json_unicode_bug = (length $unicode_str2) > 1;
if ( $json_unicode_bug ) {
say "unicode bug is present..";
# need to run decode_utf8() on every returned json object from DBI
}
更改 SQL 创建 test
table 的请求如下
my $query = "
CREATE TABLE IF NOT EXISTS `test` (
`id` INT UNSIGNED,
`my_json` JSON NOT NULL,
`my_text` MEDIUMTEXT NOT NULL
) ENGINE=InnoDB DEFAULT
CHARSET=utf8mb4
COLLATE=utf8mb4_unicode_ci
";
$dbh->do($query);
F0.9F.98.8D -- UTF-8 encoding -- this is good
1F60D -- Unicode codepoint -- not useful in `MEDIUMTEXT`
做
SELECT HEX(my_json) FROM test WHERE id = ...
看看里面有什么MySQL;应该是 F09F988D
.
您不应编码或解码写入或读取自 MySQL 的任何字符串。