SQLAlchemy：当 Python 期望它是 UTF-8 时处理 CP-1252 数据

Question

我正在使用现有的 SQLite 数据库，由于数据在 CP-1252 中编码而遇到错误，而 Python 期望它是 UTF-8。

>>> import sqlite3
>>> conn = sqlite3.connect('dnd.sqlite')
>>> curs = conn.cursor()
>>> result = curs.execute("SELECT * FROM dnd_characterclass WHERE id=802")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
OperationalError: Could not decode to UTF-8 column 'short_description_html'
with text ' <p>Over a dozen deities have worshipers who are paladins, 
promoting law and good across Faer�n, but it is the Weave itself that

违规字符是 [=12=]xfb，解码为 û。其他违规文本包括 “?nd and slay illithids.”，它使用 "smart quotes" [=15=]x93 和 [=16=]x94.

SQLite, python, unicode, and non-utf data 详细说明了单独使用 sqlite3 时如何解决此问题。

但是，我正在使用 SQLAlchemy。 当我使用 SQLAlchemy 时，如何处理 SQLite 数据库中的 CP-1252 编码数据？

编辑：

这也适用于 SQLite TEXT 字段中的任何其他有趣的编码，例如 latin-1、cp437 等。

Answer 1

如果您有连接 URI，则可以将以下选项添加到您的数据库连接 URI：

DB_CONNECTION = mysql+pymysql://{username}:{password}@{host}/{db_name}?{options}
DB_OPTIONS = {
    "charset": "cp-1252",
    "use_unicode": 1,
}
connection_uri = DB_CONNECTION.format(
    username=???,
    ...,
    options=urllib.urlencode(DB_OPTIONS)        
)

假设您的 SQLLite 驱动程序可以处理这些选项（pymysql 可以，但我不是 100% 了解 sqllite），那么您的查询将 return unicode 字符串。

Answer 2

SQLAlchemy 和 SQLite 运行正常。解决方法是修复数据库中的非UTF-8数据。

下面是我写的，灵感来自。它：

加载目标 SQLite 数据库
列出所有表中的所有列
如果该列是 text、char 或 clob 类型 - 包括 varchar 和 longtext 等变体 - 它会重新编码数据从 INPUT_ENCODING 到 UTF-8.

INPUT_ENCODING = 'cp1252' # The encoding you want to convert from
import sqlite3
db = sqlite3.connect('dnd_fixed.sqlite')
db.create_function('FIXENCODING', 1, lambda s: str(s).decode(INPUT_ENCODING))
cur = db.cursor()
tables = cur.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
tables = [t[0] for t in tables]
for table in tables:
    columns = cur.execute('PRAGMA table_info(%s)' % table ).fetchall() # Note: pragma arguments can't be parameterized.
    for column_id, column_name, column_type, nullable, default_value, primary_key in columns:
        if ('char' in column_type) or ('text' in column_type) or ('clob' in column_type):
            # Table names and column names can't be parameterized either.
            db.execute('UPDATE "{0}" SET "{1}" = FIXENCODING(CAST("{1}" AS BLOB))'.format(table, column_name))

此脚本运行后，*text*、*char*、*clob*字段全部为UTF-8，不会再出现Unicode解码错误。我现在可以 Faerûn 尽情享受了。

SQLAlchemy：当 Python 期望它是 UTF-8 时处理 CP-1252 数据

SQLAlchemy: dealing with CP-1252 data when Python is expecting it to be UTF-8

python

sqlite

sqlalchemy