如何通过 Python 3 中作为命令行参数提供的转义序列拆分 UTF-8 字符串?
How to split an UTF-8 string by an escape sequence provided as command line argument in Python 3?
我正在尝试通过 Python3 中作为命令行参数提供的定界符来分隔 UTF-8 字符串。 TAB 字符“\t”应该是一个有效的选项。不幸的是,我没有找到任何解决方案来解释转义序列。我写了一个名为 "test.py"
的小测试脚本
1 # coding: utf8
2 import sys
3
4 print(sys.argv[1])
5
6 l1 = u"12345\tktktktk".split(sys.argv[1])
7 print(l1)
8
9 l2 = u"633\tbgt".split(sys.argv[1])
10 print(l2)
我尝试 运行 该脚本如下(在 kubuntu linux 主机上的 guake shell 中):
- python3 test.py \t
- python3 test.py \t
- python3 test.py '\t'
- python3 test.py "\t"
这些解决方案均无效。我还尝试了一个包含 "real"(不幸的是机密数据)的较大文件,其中出于某些奇怪的原因,在许多(但到目前为止不是全部)情况下,使用第一次调用时,行被正确分割。
Python3 将命令行参数解释为转义序列而不是字符串的正确方法是什么?
至少在 Bash 上 Linux 你需要使用 CTRL + V
+ TAB
:
示例:
python utfsplit.py '``CTRL+V TAB``'
您的代码在其他方面有效:
$ python3.4 utfsplit.py ' '
['12345', 'ktktktk']
['633', 'bgt']
注意: 制表符不能在这里显示:)
您可以使用 $
:
python3 test.py $'\t'
Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. Backslash escape sequences, if present, are decoded as follows:
\a
alert (bell)
\b
backspace
\e
\E
an escape character (not ANSI C)
\f
form feed
\n
newline
\r
carriage return
\t
horizontal tab <-
............
输出:
$ python3 test.py $'\t'
['12345', 'ktktktk']
['633', 'bgt']
This is especially useful when you want to give special characters as arguments to some programs, like giving a newline to sed.
The resulting text is treated as if it was single-quoted. No further expansions happen.
The $'...' syntax comes from ksh93, but is portable to most modern shells including pdksh. A specification for it was accepted for SUS issue 7. There are still some stragglers, such as most ash variants including dash, (except busybox built with "bash compatibility" features).
或使用python:
arg = bytes(sys.argv[1], "utf-8").decode("unicode_escape")
print(arg)
l1 = u"12345\tktktktk".split(arg)
print(l1)
l2 = u"633\tbgt".split(arg)
print(l2)
输出:
$ python3 test.py '\t'
['12345', 'ktktktk']
['633', 'bgt']
我正在尝试通过 Python3 中作为命令行参数提供的定界符来分隔 UTF-8 字符串。 TAB 字符“\t”应该是一个有效的选项。不幸的是,我没有找到任何解决方案来解释转义序列。我写了一个名为 "test.py"
的小测试脚本 1 # coding: utf8
2 import sys
3
4 print(sys.argv[1])
5
6 l1 = u"12345\tktktktk".split(sys.argv[1])
7 print(l1)
8
9 l2 = u"633\tbgt".split(sys.argv[1])
10 print(l2)
我尝试 运行 该脚本如下(在 kubuntu linux 主机上的 guake shell 中):
- python3 test.py \t
- python3 test.py \t
- python3 test.py '\t'
- python3 test.py "\t"
这些解决方案均无效。我还尝试了一个包含 "real"(不幸的是机密数据)的较大文件,其中出于某些奇怪的原因,在许多(但到目前为止不是全部)情况下,使用第一次调用时,行被正确分割。
Python3 将命令行参数解释为转义序列而不是字符串的正确方法是什么?
至少在 Bash 上 Linux 你需要使用 CTRL + V
+ TAB
:
示例:
python utfsplit.py '``CTRL+V TAB``'
您的代码在其他方面有效:
$ python3.4 utfsplit.py ' '
['12345', 'ktktktk']
['633', 'bgt']
注意: 制表符不能在这里显示:)
您可以使用 $
:
python3 test.py $'\t'
Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. Backslash escape sequences, if present, are decoded as follows:
\a
alert (bell)
\b
backspace
\e
\E
an escape character (not ANSI C)
\f
form feed
\n
newline
\r
carriage return
\t
horizontal tab <-
............
输出:
$ python3 test.py $'\t'
['12345', 'ktktktk']
['633', 'bgt']
This is especially useful when you want to give special characters as arguments to some programs, like giving a newline to sed.
The resulting text is treated as if it was single-quoted. No further expansions happen.
The $'...' syntax comes from ksh93, but is portable to most modern shells including pdksh. A specification for it was accepted for SUS issue 7. There are still some stragglers, such as most ash variants including dash, (except busybox built with "bash compatibility" features).
或使用python:
arg = bytes(sys.argv[1], "utf-8").decode("unicode_escape")
print(arg)
l1 = u"12345\tktktktk".split(arg)
print(l1)
l2 = u"633\tbgt".split(arg)
print(l2)
输出:
$ python3 test.py '\t'
['12345', 'ktktktk']
['633', 'bgt']