Python 与 Perl 正则表达式中的反斜杠和转义字符

Question

目标是处理 NLP 中的标记化任务并从 Perl script to this Python script.

移植脚本

主要问题是当我们运行分词器的 Python 端口时出现错误的反斜杠。

在 Perl 中，我们可能需要转义单引号和符号：

my($text) = @_; # Reading a text from stdin

$text =~ s=n't = n't =g; # Puts a space before the "n't" substring to tokenize english contractions like "don't" -> "do n't".

$text =~ s/\'/\&apos;/g;  # Escape the single quote so that it suits XML.

将正则表达式逐字移植到 Python

>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\&apos;" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
...     text = re.sub(regexp, substitution, text)
...     print text
... 
this ai n't funny
this ai n\&apos;t funny

＆符号的转义以某种方式将其添加为文字反斜杠=（

要解决这个问题，我可以这样做：

>>> escape_singquote = r"\'", r"&apos;" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
...     text = re.sub(regexp, substitution, text)
...     print text
... 
this ai n't funny
this ai n&apos;t funny

但貌似没有转义Python中的单引号，我们也得到了想要的结果：

>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\&apos;" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> escape_singquote = r"'", r"&apos;" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
...     text = re.sub(regexp, substitution, text)
...     print text
... 
this ai n't funny
this ai n&apos;t funny

这很令人费解...

鉴于上述上下文，问题是 我们需要在 Python 中转义哪些字符以及在 Perl 中转义哪些字符？ Perl 中的正则表达式和 Python 不是等价的吗？

Answer 1

在 Perl 和 Python 中，如果您想在字符 class¹ 之外匹配它们，则必须转义以下正则表达式元字符:

{}[]()^$.|*+?\

在一个字符class里面，你必须根据这些规则转义元字符²:

     Perl                          Python
-------------------------------------------------------------
-    unless at beginning or end    unless at beginning or end
]    always                        unless at beginning
\    always                        always
^    only if at beginning          only if at beginning
$    always                        never

请注意，无论是在字符 class.

内部还是外部，单引号 ' 和符号 & 都必须转义

但是，如果您使用反斜杠转义不是元字符的标点符号（例如 \' 等同于 '正则表达式）。

你似乎被 Python 的 raw strings 绊倒了:

When an 'r' or 'R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string.

r"\'" 是字符串 \'（文字反斜杠、文字单引号），而 r'\'' 是字符串 \'（文字反斜杠、文字 & 符号等）。 ).

所以这个：

re.sub(r"\'", r'\&apos;', text)

用文字文本 \'.

替换所有单引号

综合起来，您的 Perl 替换写得更好：

$text =~ s/'/&apos;/g;

你的 Python 替换写得更好：

re.sub(r"'", r'&apos;', text)

Python 2、Python 3 和当前版本的 Perl 将非转义花括号视为原义花括号（如果它们不是量词的一部分）。但是，这在以后的Perl版本中会是语法错误，最近的Perl版本会给出警告。
参见 perlretut, perlre, and the Python docs for the re module。

Python 与 Perl 正则表达式中的反斜杠和转义字符

Backslashes and escaping chars in Python vs Perl regexes

python

regex

perl

escaping

tokenize