在 Python 中与 unicode 作斗争

Question

我正在尝试自动从大量文件中提取数据，它在大多数情况下都有效。遇到非ASCII字符就直接倒下：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 5: ordinal not in range(128)

如何将 'brand' 设置为 UTF-8？我的代码正在从其他东西（使用 lxml）中重新利用，并且没有任何问题。我看过很多关于编码/解码的讨论，但我不明白我应该如何实现它。下面被简化为只包含相关代码 - 我已经删除了其余代码。

i = 0

filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))]

for i in range (len(filenames)):
    pathname = filenames[i]

    fin = open(pathname, 'r')
    with codecs.open(('Assets'+'.log'), mode='w', encoding='utf-8') as f:
        f.write(u'File Path|Brand\n')
        lines = fin.read()
        brand_start = lines.find("Brand Title")
        brand_end = lines.find("/>",brand_start)
        brand = lines [brand_start+47:brand_end-2]
        f.write(u'{}|{}\n'.format(pathname[4:35],brand))

flog.close()

我确信有更好的方法来编写整个内容，但目前我的重点只是试图了解如何让行/读取函数与 UTF-8 一起工作。

Answer 1

您将字节串与 Unicode 值混合；您的 fin 文件对象生成字节串，并且您在此处将其与 Unicode 混合：

f.write(u'{}|{}\n'.format(pathname[4:35],brand))

brand 是一个字节串，被插入到一个 Unicode 格式的字符串中。在那里解码 brand，或者更好的是，使用 io.open()（而不是 codecs.open()，它不像较新的 io 模块那样健壮）来管理两个你的文件：

with io.open('Assets.log', 'w', encoding='utf-8') as f,\
        io.open(pathname, encoding='utf-8') as fin:
    f.write(u'File Path|Brand\n')
    lines = fin.read()
    brand_start = lines.find(u"Brand Title")
    brand_end = lines.find(u"/>", brand_start)
    brand = lines[brand_start + 47:brand_end - 2]
    f.write(u'{}|{}\n'.format(pathname[4:35], brand))

您似乎也在手动解析 XML 文件；也许您想使用 ElementTree API 来解析这些值。在那种情况下，您将在没有 io.open() 的情况下打开文件，从而生成字节字符串，以便 XML 解析器可以为您正确地将信息解码为 Unicode 值。

Answer 2

这是我的最终代码，使用了上面的指导。它不漂亮，但它解决了问题。稍后我会使用 lxml 来实现所有功能（因为这是我之前在处理不同的、更大的 xml 文件时遇到的问题）：

import lxml
import io
import os

from lxml import etree
from glob import glob

nsmap = {'xmlns': 'thisnamespace'}

i = 0

filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))] 

with io.open(('Assets.log'),'w',encoding='utf-8') as f:
    f.write(u'File Path|Series|Brand\n')

    for i in range (len(filenames)):
        pathname = filenames[i]
        parser = lxml.etree.XMLParser()
        tree = lxml.etree.parse(pathname, parser)
        root = tree.getroot()
        fin = open(pathname, 'r')

        with io.open(pathname, encoding='utf-8') as fin:  

            for info in root.xpath('//somepath'):
                series_x = info.find ('./somemorepath')
                series = series_x.get('Asset_Name') if series_x != None else 'Missing'
                lines = fin.read()
                brand_start = lines.find(u"sometext")
                brand_end = lines.find(u"/>",brand_start)
                brand = lines [brand_start:brand_end-2]
                brand = brand[(brand.rfind("/"))+1:]
                f.write(u'{}|{}|{}\n'.format(pathname[5:42],series,brand))

f.close()

现在会有人过来并在一行中完成所有操作！

在 Python 中与 unicode 作斗争

Struggling with unicode in Python

python

utf-8