编码 "UTF8" 的无效字节序列：0xed 0xa0 0xbd

Question

我一直在将一些数据从 MySQL 导入到 Postgres，计划应该很简单 - 手动重新创建具有等效数据类型的 tables，划分一种输出方式CSV，传输数据，将其复制到 Postgres 中。完成。

mysql -u whatever -p whatever -d the_database

SELECT * INTO OUTFILE '/tmp/the_table.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\' FROM the_table;

发送并导入到 postgres

psql -etcetc -d other_database

COPY the_table FROM '/csv/file/location/the_table.csv' WITH( FORMAT CSV, DELIMITER ',', QUOTE '"', ESCAPE '\', NULL '\N' );

时间太久了，我忘了'0000-00-00'是个东西... 所以首先我必须想出一些解决奇怪数据类型的方法，最好是在 MySQL 末尾，因此为我计划导入的 20 个左右的 table 编写了这个脚本以解决任何不兼容性并相应地列出列

with a as (
    select
        'the_table'::text as tblname,
        'public'::text as schname
), b as (
    select array_to_string( array_agg( x.column_name ), ',' ) as the_cols from (
        select
            case
                when udt_name = 'timestamp'
                then 'NULLIF('|| column_name::text || ',''0000-00-00 00:00:00'')'
                when udt_name = 'date'
                then 'NULLIF('|| column_name::text || ',''0000-00-00'')'
                else column_name::text
            end as column_name
        from information_schema.columns, a
        where table_schema = a.schname
        and table_name = a.tblname
        order by ordinal_position
    ) x
)
select 'SELECT '|| b.the_cols ||' INTO OUTFILE ''/tmp/'|| a.tblname ||'.csv'' FIELDS TERMINATED BY '','' OPTIONALLY ENCLOSED BY ''"'' ESCAPED BY ''\'' FROM '|| a.tblname ||';' from a,b;

生成 CSV，好的。转过去，ok - 过了...

BEGIN;
ALTER TABLE the_table SET( autovacuum_enabled = false, toast.autovacuum_enabled = false );
COPY the_table FROM '/csv/file/location/the_table.csv' WITH( FORMAT CSV, DELIMITER ',', QUOTE '"', ESCAPE '\', NULL '\N' ); -- '
ALTER TABLE the_table SET( autovacuum_enabled = true, toast.autovacuum_enabled = true );
COMMIT;

一切都很顺利，直到我看到这条消息：

ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd
CONTEXT:  COPY new_table, line 12345678

第二个table也遇到了同样的错误，但是其他的都成功导入了。现在 MySQL 数据库中的所有列和 table 都设置为 utf8，第一个违规的 table 包含消息是沿着

的行

CREATE TABLE whatever(
col1 int(11) NOT NULL AUTO_INCREMENT,
col2 date,
col3 int(11),
col4 int(11),
col5 int(11),
col6 int(11),
col7 varchar(64),
PRIMARY KEY(col1)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

所以大概数据应该是utf...对吧？为了确保没有重大错误，我编辑了 my.cnf 以确保我能想到的所有内容都包含编码

[character sets]
default-character-set=utf8
default-character-set=utf8
character-set-server = utf8
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'

为了转换

，我更改了最初的 "query generating query" case 语句以转换列

        case
            when udt_name = 'timestamp'
            then 'NULLIF('|| column_name::text || ',''0000-00-00 00:00:00'')'
            when udt_name = 'date'
            then 'NULLIF('|| column_name::text || ',''0000-00-00'')'
            when udt_name = 'text'
            then 'CONVERT('|| column_name::text || ' USING utf8)'
            else column_name::text
        end as column_name

仍然没有运气。在谷歌搜索“0xed 0xa0 0xbd”后，我仍然 none 更聪明，字符集不是我的菜。我什至打开了 3 gig csv 文件到它提到的那一行，似乎没有任何不合适的地方，用十六进制编辑器看我看不到那些字节值（编辑：也许我看起来不够努力）所以我开始运行没主意了。我是否遗漏了一些非常简单的东西，令人担忧的是，其他一些 table 是否也可能 "silently" 损坏得更多？

MySQL 版本在 ubuntu 14.04 操作系统上是 5.5.44，Postgres 是 9.4

Answer 1

没有任何进一步的尝试，我选择了最简单的解决方案，只需更改文件

iconv -f utf-8 -t utf-8 -c the_file.csv > the_file_iconv.csv

新文件和原始文件之间大约有 100 个字节，所以那里一定有我看不到的无效字节，他们导入了 "properly" 所以我想这很好，但是很高兴知道在导入文件之前是否有一些方法可以在创建文件时强制执行正确的编码。

编码 "UTF8" 的无效字节序列：0xed 0xa0 0xbd

invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd

mysql

postgresql

utf-8

mysql-5.5

postgresql-9.4