在以大写字母开头的单词之间用下划线替换空格的正则表达式语句

Regex statement to replace spaces with underscore between words starting with Capital Letter

输入如下:

Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic.

我期待这样的输出:

Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

我尝试使用正后视(使用 Python re 包)的解决方案是:

re.sub(r"(?<=\w)\s([A-Z])", r"_", above_string)

但是在这里,由于 \w,我得到一个输出:

Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is_Novak_Djokovic.

当然,我无法使用 r"(?<=[A-Z]\w*)\s([A-Z])" 使其工作,因为

error: look-behind requires fixed-width pattern

我必须将此正则表达式应用到大量(且种类繁多的)文章中,因此我无法承受任何循环或 str.replace 暴力破解。我想知道是否有人可以提出 有效的解决方案

如果你不关心所有的Unicode大写字母,你可以使用

import re
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( re.sub(r"\b([A-Z]\w*)\s+(?=[A-Z])", r"_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

参见Python demo. See the regex demo详情:

  • \b - 单词边界
  • ([A-Z]\w*) - 第 1 组 (</code>):一个大写字母和零个或多个单词字符</li> <li><code>\s+ - 一个或多个空格
  • (?=[A-Z]) - 匹配紧跟大写字母的位置的正面前瞻。

如果需要支持所有Unicode字母,建议pip install regex并使用

import regex
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( regex.sub(r"\b(\p{Lu}\w*)\s+(?=\p{Lu})", r"_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

参见 this Python demo。这里,\p{Lu} 匹配任何 Unicode 大写字母。