为什么 Python 字符串连接适用于俄语文本，但 string.format() 不适用

Question

我正在尝试解析（并转义）存储在 Windows-1251 character encoding. Using this excellent answer 中的 CSV 文件的行来处理这种编码我最终用这一行来测试输出，对于一些这样做的原因：

print(row[0]+','+row[1])

输出：

Тяжелый Уборщик Обязанности,1 литр

虽然这条线不起作用：

print("{0},{1}".format(*row))

输出此错误：

Name,Variant

Traceback (most recent call last):
  File "Russian.py", line 26, in <module>
    print("{0},{1}".format(*row))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)

这是 CSV 文件的前两行：

Name,Variant
Тяжелый Уборщик Обязанности,1 литр

如果有帮助，这里是 Russian.py 的完整来源：

import csv
import cgi
from chardet.universaldetector import UniversalDetector
chardet_detector = UniversalDetector()

def charset_detect(f, chunk_size=4096):
    global chardet_detector
    chardet_detector.reset()
    while 1:
        chunk = f.read(chunk_size)
        if not chunk: break
        chardet_detector.feed(chunk)
        if chardet_detector.done: break
    chardet_detector.close()
    return chardet_detector.result

with open('Russian.csv') as csv_file:
    cd_result = charset_detect(csv_file)
    encoding = cd_result['encoding']
    csv_file.seek(0)
    csv_reader = csv.reader(csv_file)
    for bytes_row in csv_reader:
        row = [x.decode(encoding) for x in bytes_row]
        if len(row) >= 6:
            #print(row[0]+','+row[1])
            print("{0},{1}".format(*row))

Answer 1

您列表中的字符串可能已经是 unicode，因此您没有遇到问题。

print(row[0]+','+row[1])
Тяжелый Уборщик Обязанности,1 литр

但是这里我们试图将unicode添加到一个普通的字符串中！这就是为什么你得到 UnicodeEncodeError。

print("{0},{1}".format(*row))

所以只需将其更改为：

print(u"{0}, {1}".format(*row))

Answer 2

+ 操作数在 unicode 字符串和 str 字符串之间工作正常。另一方面，str.format 不接受 unicode 字符串作为参数。

因此，您可以简单地将有问题的行替换为以下内容：

print(u"{0},{1}".format(*row))

这应该可以解决问题。

Answer 3

您正在使用 str.format()，它将 unicode() 隐式转换为 str()。它必须这样做才能将值插入到提供的模板中。

改用unicode.format()：

print(u"{0},{1}".format(*row))

注意格式文字前的 u。 unicode.format() 必须解码 str 输入以适应生成的 Unicode 输出。

另一方面，串联可以隐式解码以产生最终的unicode()对象结果。如果您的 ',' 值包含非 ASCII 字节，那么隐式解码也会失败。

故事的寓意：在处理文本时在整个代码中使用 Unicode 字符串文字。

为什么 Python 字符串连接适用于俄语文本，但 string.format() 不适用

Why does Python String concatenation work with Russian text but string.format() does not

python

csv

character-encoding

windows-1251