如何通过 Python 3 中作为命令行参数提供的转义序列拆分 UTF-8 字符串?

How to split an UTF-8 string by an escape sequence provided as command line argument in Python 3?

我正在尝试通过 Python3 中作为命令行参数提供的定界符来分隔 UTF-8 字符串。 TAB 字符“\t”应该是一个有效的选项。不幸的是,我没有找到任何解决方案来解释转义序列。我写了一个名为 "test.py"

的小测试脚本
  1 # coding: utf8
  2 import sys
  3 
  4 print(sys.argv[1])
  5 
  6 l1 = u"12345\tktktktk".split(sys.argv[1])
  7 print(l1)
  8 
  9 l2 = u"633\tbgt".split(sys.argv[1])
 10 print(l2)

我尝试 运行 该脚本如下(在 kubuntu linux 主机上的 guake shell 中):

  1. python3 test.py \t
  2. python3 test.py \t
  3. python3 test.py '\t'
  4. python3 test.py "\t"

这些解决方案均无效。我还尝试了一个包含 "real"(不幸的是机密数据)的较大文件,其中出于某些奇怪的原因,在许多(但到目前为止不是全部)情况下,使用第一次调用时,行被正确分割。

Python3 将命令行参数解释为转义序列而不是字符串的正确方法是什么?

至少在 Bash 上 Linux 你需要使用 CTRL + V + TAB:

示例:

python utfsplit.py '``CTRL+V TAB``'

您的代码在其他方面有效:

$ python3.4 utfsplit.py '       '

['12345', 'ktktktk']
['633', 'bgt']

注意: 制表符不能在这里显示:)

您可以使用 $:

python3 test.py $'\t'

ANSI_002dC-Quoting

Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. Backslash escape sequences, if present, are decoded as follows:

\a
alert (bell)

\b
backspace

\e
\E
an escape character (not ANSI C)

\f
form feed

\n
newline

\r
carriage return

\t
horizontal tab <-
............

输出:

$ python3 test.py $'\t'
    
['12345', 'ktktktk']
['633', 'bgt']

wiki.bash-hackers

This is especially useful when you want to give special characters as arguments to some programs, like giving a newline to sed.

The resulting text is treated as if it was single-quoted. No further expansions happen.

The $'...' syntax comes from ksh93, but is portable to most modern shells including pdksh. A specification for it was accepted for SUS issue 7. There are still some stragglers, such as most ash variants including dash, (except busybox built with "bash compatibility" features).

或使用python:

 arg = bytes(sys.argv[1], "utf-8").decode("unicode_escape")

print(arg)

l1 = u"12345\tktktktk".split(arg)
print(l1)

l2 = u"633\tbgt".split(arg)
print(l2)

输出:

$ python3 test.py '\t'
    
['12345', 'ktktktk']
['633', 'bgt']