关于 Unicode 和排序规则，如何使 MySQL 像 SQLite 那样处理字符串？

Question

我已经在 SO 上、MySQL 文档和其他地方研究这个问题几个小时了，但仍然找不到令人满意的解决方案。问题是：

使 MySQL 像 SQLite 一样处理字符串而无需任何额外的 "smart" 转换的最简单方法是什么？

例如，以下代码在 SQLite 中完美运行：

CREATE TABLE `dummy` (`key` VARCHAR(255) NOT NULL UNIQUE);

INSERT INTO `dummy` (`key`) VALUES ('one');
INSERT INTO `dummy` (`key`) VALUES ('one ');
INSERT INTO `dummy` (`key`) VALUES ('One');
INSERT INTO `dummy` (`key`) VALUES ('öne');

SELECT * FROM `dummy`;

但是，在 MySQL 中，具有以下设置：

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_bin

和以下 CREATE DATABASE 语句：

CREATE DATABASE `dummydb` DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_bin;

第二次INSERT仍然失败。

我宁愿让字符串列声明尽可能简单，SQLite 的 TEXT 是最理想的。 看起来 VARBINARY 是可行的方法，但我仍然想听听您对任何其他方面的意见，可能更好选项.

附录：SHOW CREATE TABLE dummy输出是

mysql> SHOW CREATE TABLE dummy;
+-------+-----------------------------------------------------
| Table | Create Table                                        
+-------+-----------------------------------------------------
| dummy | CREATE TABLE `dummy` (
  `key` varchar(255) COLLATE utf8mb4_bin NOT NULL,
  UNIQUE KEY `key` (`key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin |
+-------+-----------------------------------------------------
1 row in set (0.00 sec)

Answer 1

MySQL 想在做 INSERT 和 SELECT 时转换字符串。转换是在您声明 client 拥有的内容与声明 column 存储的内容之间进行的。

避免这种情况的唯一方法是使用 VARBINARY 和 BLOB 而不是 VARCHAR 和 TEXT。

使用COLLATION utf8mb4_bin不避免转换to/fromCHARACTER SET utf8mb4；它只是说 WHERE 和 ORDER BY 应该比较位而不是处理重音和大小写折叠。

请记住，CHARACTER SET utf8mb4 是一种编码文本的方式； COLLATION utf8mb4_* 是比较该编码文本的规则。 _bin头脑简单。

UNIQUE 涉及比较是否相等，因此 COLLATION。在大多数 utf8mb4 归类中，3（没有空格）将比较相等。 utf8mb4_bin 会将 3 视为不同。 utf8mb4_hungarian_ci 对待 one=One>öne。

尾随空格由列的数据类型控制（VARCHAR 或其他）。最新版本甚至有关于是否考虑尾随空格的设置。

Answer 2

由于以下原因，问题中显示的方法应该（大部分）在 MySQL 中工作得很好：

排序规则（不要与编码混淆）是定义如何排序和比较字符的集合或规则，通常用于从文化角度在数据库级别复制用户期望（如果我搜索 cafe 我希望也能找到 café）。
归类在唯一约束上起着重要的作用，因为它建立了 unique.
二进制排序规则专门用于忽略文化规则并在字节级别工作，因此 utf8mb4_bin 是这里的正确选择。
MySQL 允许以列级粒度设置编码和排序规则的组合。
如果列定义缺少排序规则，它将使用 table 第一级。
如果 table 定义缺少排序规则，它将使用数据库一级。
如果数据库定义缺少排序规则，它将使用第一级服务器。

还值得注意的是，MySQL 将在编码之间透明地转换，只要：

连接编码设置正确
转换在物理上是可能的（例如所有源字符也属于目标编码）

出于最后一个原因，VARBINARY 可能不是仍然是文本的列的最佳选择，因为它打开了从配置为使用 ISO-8859 的连接中存储 café 的大门- 1 并且无法从配置为使用 UTF-8 的连接中正确检索它。

旁注：显示的 table 定义可能会触发以下错误：

ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

索引的最大大小可能相对较小。来自 docs:

If innodb_large_prefix is enabled (the default), the index key prefix limit is 3072 bytes for InnoDB tables that use DYNAMIC or COMPRESSED row format. If innodb_large_prefix is disabled, the index key prefix limit is 767 bytes for tables of any row format.

innodb_large_prefix is deprecated and will be removed in a future release. innodb_large_prefix was introduced in MySQL 5.5 to disable large index key prefixes for compatibility with earlier versions of InnoDB that do not support large index key prefixes.

The index key prefix length limit is 767 bytes for InnoDB tables that use the REDUNDANT or COMPACT row format. For example, you might hit this limit with a column prefix index of more than 255 characters on a TEXT or VARCHAR column, assuming a utf8mb3 character set and the maximum of 3 bytes for each character.

Attempting to use an index key prefix length that exceeds the limit returns an error. To avoid such errors in replication configurations, avoid enabling innodb_large_prefix on the master if it cannot also be enabled on slaves.

由于utf8_mb8为每个字符分配 4 个字节，因此 767 的限制将仅溢出 192 个字符。

我们还有一个问题：

mysql> CREATE TABLE `dummy` (
    -> `key` varchar(191) COLLATE utf8mb4_bin NOT NULL,
    -> UNIQUE KEY `key` (`key`)
    -> )
    -> ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO `dummy` (`key`) VALUES ('one');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO `dummy` (`key`) VALUES ('one ');
ERROR 1062 (23000): Duplicate entry 'one ' for key 'key'

对不起？

mysql> INSERT INTO `dummy` (`key`) VALUES ('One');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO `dummy` (`key`) VALUES ('öne');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM `dummy`;
+-----+
| key |
+-----+
| One |
| one |
| öne |
+-----+
3 rows in set (0.00 sec)

最后一期是 MySQL 归类的一个有趣的微妙之处。来自 docs:

All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the LIKE pattern-matching operator, for which trailing spaces are significant

[...] For those cases where trailing pad characters are stripped or comparisons ignore them, if a column has an index that requires unique values, inserting into the column values that differ only in number of trailing pad characters will result in a duplicate-key error.

我敢说 VARBINARY 类型是克服这个问题的唯一方法...

关于 Unicode 和排序规则，如何使 MySQL 像 SQLite 那样处理字符串？

How to make MySQL handle strings like SQLite does, with regard to Unicode and collation?

mysql

unicode

collation