ruamel.yaml如何确定字符串中转义字节序列的编码？

Question

我无法确定修改或配置的位置 ruamel.yaml's loader to get it to parse some old YAML with the correct encoding. The essence of the problem is that an escaped byte sequence in the document seems to be interpreted as latin1, and I have no earthly clue where it is doing that, after some source diving here。这是一个演示行为的代码示例（这在 Python 3.6 中特别是运行）：

from ruamel.yaml import YAML
yaml = YAML()
yaml.load('a:\n  b: "\xE2\x80\x99"\n')  # Note that this is a str (that is, unicode) with escapes for the byte escapes in the YAML document
# ordereddict([('a', ordereddict([('b', 'â\x80\x99')]))])

这里是手动解码的相同字节，只是为了显示它应该解析的内容：

>>> b"\xE2\x80\x99".decode('utf8')
'’'

请注意，我实际上无法控制源文档，因此无法使用 ruamel.yaml 修改它以生成正确的输出。

Answer 1

ruamel.yaml 不解释单个字符串，它解释 stream 它被处理，即 .load() 的参数。如果说参数是字节流或类似对象的文件，那么它的编码是根据BOM确定，默认为UTF-8。但又是：那是在流级别，而不是在之后的单个标量内容解释逃脱。既然你手 .load() Unicode（因为这是 Python 3) "stream" 不需要进一步解码。（虽然与这个问题无关：它是在 reader.py:Reader 方法 stream 和 determine_encoding)

十六进制转义（形式为\xAB），将只放入一个特定的十六进制加载器用来构造标量的类型中的值，即键 'b' 的值，这是一个正常的 Python 3 str 即 Unicode 其内部表示之一。你得到 â 在你的输出是因为你的Python是如何配置解码它的str 是的。

所以你不会 "find" ruamel.yaml 解码的地方字节序列，因为它已经假定为 Unicode。

所以要做的是你双重解码你的双引号标量（你只需要将它们作为普通的、单引号的、 literal/folded 标量不能有十六进制转义）。有各种各样的你可以尝试这样做的点，但我认为 constructor.py:RoundTripConsturtor.construct_scalar 和 scalarstring.py:DoubleQuotedScalarString 是最佳人选。前者可能需要一些挖掘才能找到，但后者实际上是你检查后会得到的类型当您添加保留引号的选项时加载后的字符串：

yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n  b: "\xE2\x80\x99"\n')
print(type(data['a']['b']))

打印：

<class 'ruamel.yaml.scalarstring.DoubleQuotedScalarString'>

知道您可以检查那个相当简单的包装器 class:

class DoubleQuotedScalarString(ScalarString):
    __slots__ = ()

    style = '"'

    def __new__(cls, value, anchor=None):
        # type: (Text, Any) -> Any
        return ScalarString.__new__(cls, value, anchor=anchor)

"update" 那里唯一的方法 (__new__) 做你的双倍编码（您可能必须进行额外的检查才能不对所有内容进行双重编码双引号标量 0:

import sys
import codecs
import ruamel.yaml

def my_new(cls, value, anchor=None):
    # type information only needed if using mypy
    # value is of type 'str', decode to bytes "without conversion", then encode
    value = value.encode('latin_1').decode('utf-8') 
    return ruamel.yaml.scalarstring.ScalarString.__new__(cls, value, anchor=anchor)

ruamel.yaml.scalarstring.DoubleQuotedScalarString.__new__ = my_new

yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n  b: "\xE2\x80\x99"\n')
print(data)

给出：

ordereddict([('a', ordereddict([('b', '’')]))])

ruamel.yaml如何确定字符串中转义字节序列的编码？

How does ruamel.yaml determine the encoding of escaped byte sequences in a string?

yaml

python-3.x

ruamel.yaml