在以大写字母开头的单词之间用下划线替换空格的正则表达式语句

Question

输入如下：

Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic.

我期待这样的输出：

Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

我尝试使用正后视（使用 Python re 包）的解决方案是：

re.sub(r"(?<=\w)\s([A-Z])", r"_", above_string)

但是在这里，由于 \w，我得到一个输出：

Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is_Novak_Djokovic.

当然，我无法使用 r"(?<=[A-Z]\w*)\s([A-Z])" 使其工作，因为

error: look-behind requires fixed-width pattern

我必须将此正则表达式应用到大量（且种类繁多的）文章中，因此我无法承受任何循环或 str.replace 暴力破解。我想知道是否有人可以提出 有效的解决方案 。

Answer 1

如果你不关心所有的Unicode大写字母，你可以使用

import re
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( re.sub(r"\b([A-Z]\w*)\s+(?=[A-Z])", r"_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

参见Python demo. See the regex demo。详情:

\b - 单词边界
([A-Z]\w*) - 第 1 组 (</code>)：一个大写字母和零个或多个单词字符</li> <li><code>\s+ - 一个或多个空格
(?=[A-Z]) - 匹配紧跟大写字母的位置的正面前瞻。

如果需要支持所有Unicode字母，建议pip install regex并使用

import regex
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( regex.sub(r"\b(\p{Lu}\w*)\s+(?=\p{Lu})", r"_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

参见 this Python demo。这里，\p{Lu} 匹配任何 Unicode 大写字母。

在以大写字母开头的单词之间用下划线替换空格的正则表达式语句

Regex statement to replace spaces with underscore between words starting with Capital Letter

python

regex

string

regex-lookarounds