由于 mysql 数据中的连续字节无效,如何捕获 UnicodeDecodeError

How catch UnicodeDecodeError due to invalid continuation byte in mysql data

我正在将数千万行文本数据从 mysql 移动到搜索引擎,但无法成功处理其中一个检索到的字符串中的 Unicode 错误。我已尝试显式编码和解码检索到的字符串以导致 Python 抛出 Unicode 异常并了解问题所在。

这个异常是在 运行 之后在我的笔记本电脑上通过数千万行抛出的(叹气...),但我无法捕捉到它,跳过那行并继续我的工作想。 mysql 数据库中的所有文本都应该是 utf-8。

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 143: invalid continuation byte

这是我使用 Mysql Connector/Python

建立的连接
cnx = mysql.connector.connect(user='root', password='<redacted>',
                          host='127.0.0.1',
                          database='bloggz',
                          charset='utf-8') 

这里是数据库字符设置:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR 
Variable_name LIKE 'collation%';

+------------------------+---------------- +

| Variable_name |值 |

+------------------------+---------------- +

| character_set_client | utf8 |

| character_set_connection | utf8 |

| character_set_database | utf8 |

| character_set_filesystem |二进制 |

| character_set_results | utf8 |

| character_set_server | utf8 |

| character_set_system | utf8 |

| collation_connection | utf8_general_ci |

| collation_database | utf8_general_ci |

| collation_server | utf8_general_ci |

+------------------------+---------------- +

我下面的异常处理有什么问题?请注意,变量 "last_feeds_id" 也没有打印出来,但这可能只是 except 子句不起作用的证明。

last_feeds_id = 0
for feedsid, ts, url, bid, title, html in cursor:

  try:
    # to catch UnicodeErrors and see where the prolem lies
    # from: https://mail.python.org/pipermail/python-list/2012-July/627441.html
    # also see 

    # feeds.URL is varchar(255) in mysql
    enc_url = url.encode(encoding = 'UTF-8',errors = 'strict')
    dec_url = enc_url.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.title is varchar(600) in mysql
    enc_title = title.encode(encoding = 'UTF-8',errors = 'strict')
    dec_title = enc_title.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.html is text in mysql
    enc_html = html.encode(encoding = 'UTF-8',errors = 'strict')
    dec_html = enc_html.decode(encoding = 'UTF-8',errors = 'strict')

    data = {"timestamp":ts,
            "url":dec_url,
           "bid":bid,
           "title":dec_title,
           "html":dec_html}
    es.index(index="blogposts",
            doc_type="blogpost",
            body=data)
  except UnicodeDecodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeEncodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

它抱怨十六进制 ED。你期待 acute-i: í 吗?如果是这样,那么您的文本不是 UTF-8 编码的,而是 cp1250、dec8、latin1、latin2、latin5 之一。

您的 Python 源代码是否以

开头
# -*- coding: utf-8 -*-

more Python-utf8 tips

另外,评论 "Best Practice"

你有charset='utf-8';我不确定,但也许应该是 charset='utf8'Reference UTF-8 就是世人所说的字符集。 MySQL 调用它的 3 字节子集 utf8。请注意没有破折号。