由于 mysql 数据中的连续字节无效,如何捕获 UnicodeDecodeError
How catch UnicodeDecodeError due to invalid continuation byte in mysql data
我正在将数千万行文本数据从 mysql 移动到搜索引擎,但无法成功处理其中一个检索到的字符串中的 Unicode 错误。我已尝试显式编码和解码检索到的字符串以导致 Python 抛出 Unicode 异常并了解问题所在。
这个异常是在 运行 之后在我的笔记本电脑上通过数千万行抛出的(叹气...),但我无法捕捉到它,跳过那行并继续我的工作想。 mysql 数据库中的所有文本都应该是 utf-8。
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 143: invalid continuation byte
这是我使用 Mysql Connector/Python
建立的连接
cnx = mysql.connector.connect(user='root', password='<redacted>',
host='127.0.0.1',
database='bloggz',
charset='utf-8')
这里是数据库字符设置:
mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR
Variable_name LIKE 'collation%';
+------------------------+---------------- +
| Variable_name |值 |
+------------------------+---------------- +
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem |二进制 |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+------------------------+---------------- +
我下面的异常处理有什么问题?请注意,变量 "last_feeds_id" 也没有打印出来,但这可能只是 except 子句不起作用的证明。
last_feeds_id = 0
for feedsid, ts, url, bid, title, html in cursor:
try:
# to catch UnicodeErrors and see where the prolem lies
# from: https://mail.python.org/pipermail/python-list/2012-July/627441.html
# also see
# feeds.URL is varchar(255) in mysql
enc_url = url.encode(encoding = 'UTF-8',errors = 'strict')
dec_url = enc_url.decode(encoding = 'UTF-8',errors = 'strict')
# texts.title is varchar(600) in mysql
enc_title = title.encode(encoding = 'UTF-8',errors = 'strict')
dec_title = enc_title.decode(encoding = 'UTF-8',errors = 'strict')
# texts.html is text in mysql
enc_html = html.encode(encoding = 'UTF-8',errors = 'strict')
dec_html = enc_html.decode(encoding = 'UTF-8',errors = 'strict')
data = {"timestamp":ts,
"url":dec_url,
"bid":bid,
"title":dec_title,
"html":dec_html}
es.index(index="blogposts",
doc_type="blogpost",
body=data)
except UnicodeDecodeError as e:
print("Last feeds id: {}".format(last_feeds_id))
print(e)
except UnicodeEncodeError as e:
print("Last feeds id: {}".format(last_feeds_id))
print(e)
except UnicodeError as e:
print("Last feeds id: {}".format(last_feeds_id))
print(e)
它抱怨十六进制 ED
。你期待 acute-i: í
吗?如果是这样,那么您的文本不是 UTF-8 编码的,而是 cp1250、dec8、latin1、latin2、latin5 之一。
您的 Python 源代码是否以
开头
# -*- coding: utf-8 -*-
另外,评论 "Best Practice"
你有charset='utf-8'
;我不确定,但也许应该是 charset='utf8'
。 Reference UTF-8
就是世人所说的字符集。 MySQL 调用它的 3 字节子集 utf8
。请注意没有破折号。
我正在将数千万行文本数据从 mysql 移动到搜索引擎,但无法成功处理其中一个检索到的字符串中的 Unicode 错误。我已尝试显式编码和解码检索到的字符串以导致 Python 抛出 Unicode 异常并了解问题所在。
这个异常是在 运行 之后在我的笔记本电脑上通过数千万行抛出的(叹气...),但我无法捕捉到它,跳过那行并继续我的工作想。 mysql 数据库中的所有文本都应该是 utf-8。
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 143: invalid continuation byte
这是我使用 Mysql Connector/Python
建立的连接cnx = mysql.connector.connect(user='root', password='<redacted>',
host='127.0.0.1',
database='bloggz',
charset='utf-8')
这里是数据库字符设置:
mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR
Variable_name LIKE 'collation%';
+------------------------+---------------- +
| Variable_name |值 |
+------------------------+---------------- +
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem |二进制 |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+------------------------+---------------- +
我下面的异常处理有什么问题?请注意,变量 "last_feeds_id" 也没有打印出来,但这可能只是 except 子句不起作用的证明。
last_feeds_id = 0
for feedsid, ts, url, bid, title, html in cursor:
try:
# to catch UnicodeErrors and see where the prolem lies
# from: https://mail.python.org/pipermail/python-list/2012-July/627441.html
# also see
# feeds.URL is varchar(255) in mysql
enc_url = url.encode(encoding = 'UTF-8',errors = 'strict')
dec_url = enc_url.decode(encoding = 'UTF-8',errors = 'strict')
# texts.title is varchar(600) in mysql
enc_title = title.encode(encoding = 'UTF-8',errors = 'strict')
dec_title = enc_title.decode(encoding = 'UTF-8',errors = 'strict')
# texts.html is text in mysql
enc_html = html.encode(encoding = 'UTF-8',errors = 'strict')
dec_html = enc_html.decode(encoding = 'UTF-8',errors = 'strict')
data = {"timestamp":ts,
"url":dec_url,
"bid":bid,
"title":dec_title,
"html":dec_html}
es.index(index="blogposts",
doc_type="blogpost",
body=data)
except UnicodeDecodeError as e:
print("Last feeds id: {}".format(last_feeds_id))
print(e)
except UnicodeEncodeError as e:
print("Last feeds id: {}".format(last_feeds_id))
print(e)
except UnicodeError as e:
print("Last feeds id: {}".format(last_feeds_id))
print(e)
它抱怨十六进制 ED
。你期待 acute-i: í
吗?如果是这样,那么您的文本不是 UTF-8 编码的,而是 cp1250、dec8、latin1、latin2、latin5 之一。
您的 Python 源代码是否以
开头# -*- coding: utf-8 -*-
另外,评论 "Best Practice"
你有charset='utf-8'
;我不确定,但也许应该是 charset='utf8'
。 Reference UTF-8
就是世人所说的字符集。 MySQL 调用它的 3 字节子集 utf8
。请注意没有破折号。