Python 与 Perl 正则表达式中的反斜杠和转义字符
Backslashes and escaping chars in Python vs Perl regexes
目标是处理 NLP 中的标记化任务并从 Perl script to this Python script.
移植脚本
主要问题是当我们 运行 分词器的 Python 端口时出现错误的反斜杠。
在 Perl 中,我们可能需要转义单引号和符号:
my($text) = @_; # Reading a text from stdin
$text =~ s=n't = n't =g; # Puts a space before the "n't" substring to tokenize english contractions like "don't" -> "do n't".
$text =~ s/\'/\'/g; # Escape the single quote so that it suits XML.
将正则表达式逐字移植到 Python
>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n\'t funny
&符号的转义以某种方式将其添加为文字反斜杠=(
要解决这个问题,我可以这样做:
>>> escape_singquote = r"\'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n't funny
但貌似没有转义Python中的单引号,我们也得到了想要的结果:
>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> escape_singquote = r"'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n't funny
这很令人费解...
鉴于上述上下文,问题是 我们需要在 Python 中转义哪些字符以及在 Perl 中转义哪些字符? Perl 中的正则表达式和 Python 不是等价的吗?
在 Perl 和 Python 中,如果您想在字符 class1 之外匹配它们,则必须转义以下正则表达式元字符:
{}[]()^$.|*+?\
在一个字符class里面,你必须根据这些规则转义元字符2:
Perl Python
-------------------------------------------------------------
- unless at beginning or end unless at beginning or end
] always unless at beginning
\ always always
^ only if at beginning only if at beginning
$ always never
请注意,无论是在字符 class.
内部还是外部,单引号 '
和符号 &
都必须转义
但是,如果您使用反斜杠转义不是元字符的标点符号(例如 \'
等同于 '
正则表达式)。
你似乎被 Python 的 raw strings 绊倒了:
When an 'r'
or 'R'
prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string.
r"\'"
是字符串 \'
(文字反斜杠、文字单引号),而 r'\''
是字符串 \'
(文字反斜杠、文字 & 符号等)。 ).
所以这个:
re.sub(r"\'", r'\'', text)
用文字文本 \'
.
替换所有单引号
综合起来,您的 Perl 替换写得更好:
$text =~ s/'/'/g;
你的 Python 替换写得更好:
re.sub(r"'", r''', text)
目标是处理 NLP 中的标记化任务并从 Perl script to this Python script.
移植脚本主要问题是当我们 运行 分词器的 Python 端口时出现错误的反斜杠。
在 Perl 中,我们可能需要转义单引号和符号:
my($text) = @_; # Reading a text from stdin
$text =~ s=n't = n't =g; # Puts a space before the "n't" substring to tokenize english contractions like "don't" -> "do n't".
$text =~ s/\'/\'/g; # Escape the single quote so that it suits XML.
将正则表达式逐字移植到 Python
>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n\'t funny
&符号的转义以某种方式将其添加为文字反斜杠=(
要解决这个问题,我可以这样做:
>>> escape_singquote = r"\'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n't funny
但貌似没有转义Python中的单引号,我们也得到了想要的结果:
>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> escape_singquote = r"'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n't funny
这很令人费解...
鉴于上述上下文,问题是 我们需要在 Python 中转义哪些字符以及在 Perl 中转义哪些字符? Perl 中的正则表达式和 Python 不是等价的吗?
在 Perl 和 Python 中,如果您想在字符 class1 之外匹配它们,则必须转义以下正则表达式元字符:
{}[]()^$.|*+?\
在一个字符class里面,你必须根据这些规则转义元字符2:
Perl Python
-------------------------------------------------------------
- unless at beginning or end unless at beginning or end
] always unless at beginning
\ always always
^ only if at beginning only if at beginning
$ always never
请注意,无论是在字符 class.
内部还是外部,单引号'
和符号 &
都必须转义
但是,如果您使用反斜杠转义不是元字符的标点符号(例如 \'
等同于 '
正则表达式)。
你似乎被 Python 的 raw strings 绊倒了:
When an
'r'
or'R'
prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string.
r"\'"
是字符串 \'
(文字反斜杠、文字单引号),而 r'\''
是字符串 \'
(文字反斜杠、文字 & 符号等)。 ).
所以这个:
re.sub(r"\'", r'\'', text)
用文字文本 \'
.
综合起来,您的 Perl 替换写得更好:
$text =~ s/'/'/g;
你的 Python 替换写得更好:
re.sub(r"'", r''', text)