将 textwrap.dedent() 与 Python 中的字节一起使用 3

Using textwrap.dedent() with bytes in Python 3

当我在 Python 中使用三引号多行字符串时,我倾向于使用 textwrap.dedent 来保持代码的可读性,缩进良好:

some_string = textwrap.dedent("""
    First line
    Second line
    ...
    """).strip()

但是,在 Python 3.x 中,textwrap.dedent 似乎不适用于字节字符串。我在为返回长多行字节字符串的方法编写单元测试时遇到了这个问题,例如:

# The function to be tested

def some_function():
    return b'Lorem ipsum dolor sit amet\n  consectetuer adipiscing elit'

# Unit test

import unittest
import textwrap

class SomeTest(unittest.TestCase):
    def test_some_function(self):
        self.assertEqual(some_function(), textwrap.dedent(b"""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """).strip())

if __name__ == '__main__':
    unittest.main()

在 Python 2.7.10 中上面的代码工作正常,但在 Python 3.4.3 中它失败了:

E
======================================================================
ERROR: test_some_function (__main__.SomeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 16, in test_some_function
    """).strip())
  File "/usr/lib64/python3.4/textwrap.py", line 416, in dedent
    text = _whitespace_only_re.sub('', text)
TypeError: can't use a string pattern on a bytes-like object

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

所以:是否有 textwrap.dedent 的替代方案,适用于字节字符串?

答案 1:三引号多行字符串(和缩进)是一种便利(有时),而不是必需的。您可以改为为每行编写一个以 b'\n' 结尾的单独字节文字,然后让解析器加入它们。示例:

>>> b = (
    b'Lorem ipsum dolor sit amet\n' # first line
    b'consectetuer adipiscing elit\n' # 2nd line
    )
>>> b
b'Lorem ipsum dolor sit amet\nconsectetuer adipiscing elit\n'

我故意在结果字节中不需要的代码中添加了空格和注释,以防它们不包含在内。我有时会用文本字符串执行与上述相同的操作。

答案 2:将 textwrap.dedent 转换为处理字节(参见单独的答案)

答案3:省略b前缀,在.strip()前后添加.encode()

print(textwrap.dedent("""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """).encode())
# prints (same as Answer 2).
b'\nLorem ipsum dolor sit amet\n  consectetuer adipiscing elit\n'

答案2:textwrap主要是关于Textwrapclass和函数。 dedent 列在

# -- Loosely related functionality --------------------

据我所知, 使其成为文本 (unicode str) 的具体内容是 re 文字。我用 b 为所有 6 个前缀,瞧! (我没有编辑其他任何东西,但应该调整函数文档字符串。)

import re

_whitespace_only_re = re.compile(b'^[ \t]+$', re.MULTILINE)
_leading_whitespace_re = re.compile(b'(^[ \t]*)(?:[^ \t\n])', re.MULTILINE)

def dedent_bytes(text):
    """Remove any common leading whitespace from every line in `text`.

    This can be used to make triple-quoted strings line up with the left
    edge of the display, while still presenting them in the source code
    in indented form.

    Note that tabs and spaces are both treated as whitespace, but they
    are not equal: the lines "  hello" and "\thello" are
    considered to have no common leading whitespace.  (This behaviour is
    new in Python 2.5; older versions of this module incorrectly
    expanded tabs before searching for common leading whitespace.)
    """
    # Look for the longest leading string of spaces and tabs common to
    # all lines.
    margin = None
    text = _whitespace_only_re.sub(b'', text)
    indents = _leading_whitespace_re.findall(text)
    for indent in indents:
        if margin is None:
            margin = indent

        # Current line more deeply indented than previous winner:
        # no change (previous winner is still on top).
        elif indent.startswith(margin):
            pass

        # Current line consistent with and no deeper than previous winner:
        # it's the new winner.
        elif margin.startswith(indent):
            margin = indent

        # Find the largest common whitespace between current line
        # and previous winner.
        else:
            for i, (x, y) in enumerate(zip(margin, indent)):
                if x != y:
                    margin = margin[:i]
                    break
            else:
                margin = margin[:len(indent)]

    # sanity check (testing/debugging only)
    if 0 and margin:
        for line in text.split(b"\n"):
            assert not line or line.startswith(margin), \
                   "line = %r, margin = %r" % (line, margin)

    if margin:
        text = re.sub(rb'(?m)^' + margin, b'', text)
    return text

print(dedent_bytes(b"""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """)
      )

# prints
b'\nLorem ipsum dolor sit amet\n  consectetuer adipiscing elit\n'

似乎 dedent 不支持字节串,很遗憾。但是,如果您想要交叉兼容的代码,我建议您利用 six 库:

import sys, unittest
from textwrap import dedent

import six


def some_function():
    return b'Lorem ipsum dolor sit amet\n  consectetuer adipiscing elit'


class SomeTest(unittest.TestCase):
    def test_some_function(self):
        actual = some_function()

        expected = six.b(dedent("""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """)).strip()

        self.assertEqual(actual, expected)

if __name__ == '__main__':
    unittest.main()

这与您在问题中的要点建议相似

I could convert to unicode, use textwrap.dedent, and convert back to bytes. But this is only viable if the byte string conforms to some Unicode encoding.

但是你在这里误解了一些关于编码的东西——如果你能像一开始那样在你的测试中写字符串文字,并且文件被 python 成功解析(即正确的编码声明在模块上),则此处没有 "convert to unicode" 步骤。该文件以指定的编码(或 sys.defaultencoding,如果您未指定)进行解析,然后当字符串是一个 python 变量时,它已经被解码。