PHP/MySQL：修复隐式 mysqli::set_charset('latin1') 连接损坏的 utf8 文本

Question

所以，多年来，我的 PHP 应用程序一直使用默认的 latin1 字符集连接到 MySQL。尽管我将一些字段整理为 utf8_general_ci，但存储到它们中的实际数据是一些混杂的字符集。例如：

输入：♠ »

存储为â™ Â»

现在，当通过相同的 latin1 连接检索该数据并显示在编码设置为 utf8 的页面上时，它会像输入时一样显示：♠ » 为什么这是，我不是 100% 确定，但我猜这是因为无论什么字符集函数搞砸了它，都会修复它。

我想修复我的数据。如果我使用 mysqli::set_charset('utf8') 切换我的连接字符集，输出将按存储时的形式显示，即 â™ Â»

所以，显然我需要修复我现有的数据，然后切换我的连接字符集。

如何修复现有的混蛋数据？

编辑：

I've discovered a way to emulate the corruption process that is happening in a MySQL query: SELECT CAST(BINARY '♠ »' AS CHAR CHARACTER SET latin1) outputs â™ Â»

Perhaps if I could figure out how to perform the reverse function I could use that query to fix the existing data.

编辑 2：

I've discovered such a function: SELECT CAST(BINARY CAST('â™ Â»' AS CHAR CHARACTER SET latin1) AS CHAR CHARACTER SET utf8) outputs ♠ »

My only concern now is what this will do to any data that already happens to be actual utf8 data, which, for some reason, I do have in my database. For example, SELECT CAST(BINARY CAST('♠ »' AS CHAR CHARACTER SET latin1) AS CHAR CHARACTER SET utf8) outputs (nothing)

Answer 1

来自http://jonisalonen.com/2012/fixing-doubly-utf-8-encoded-text-in-mysql/：

将可能损坏的 latin1 文本数据转换为 utf8 的自动检测功能：

DELIMITER $$

CREATE FUNCTION maybe_utf8_decode(str text charset utf8) 
RETURNS text CHARSET utf8 DETERMINISTIC
BEGIN
declare str_converted text charset utf8;
declare max_error_count int default @@max_error_count;
set @@max_error_count = 0;
set str_converted = convert(binary convert(str using latin1) using utf8);
set @@max_error_count = max_error_count;
if @@warning_count > 0 then
    return str;
else
    return str_converted;
end if;
END$$

DELIMITER ;

用法：

update mytable set mycolumn = maybe_utf8_decode(mycolumn);

Answer 2

在尝试 "fix" 数据之前，请确保您拥有的是什么。 SELECT col, HEX(col) ... -- ♠ 可能是 3 个字节：E299A0，或者可能更多：C3A2 E284A2 C2A0。前者是Mojibake；后者是"double encoding"。维修方式不同。更多讨论 and here.

PHP/MySQL：修复隐式 mysqli::set_charset('latin1') 连接损坏的 utf8 文本

PHP/MySQL: Fixing utf8 text corrupted by implicit mysqli::set_charset('latin1') connection

php

mysql

utf-8

iso-8859-1

character-encoding