在以大写字母开头的单词之间用下划线替换空格的正则表达式语句
Regex statement to replace spaces with underscore between words starting with Capital Letter
输入如下:
Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic.
我期待这样的输出:
Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.
我尝试使用正后视(使用 Python re
包)的解决方案是:
re.sub(r"(?<=\w)\s([A-Z])", r"_", above_string)
但是在这里,由于 \w
,我得到一个输出:
Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is_Novak_Djokovic.
当然,我无法使用 r"(?<=[A-Z]\w*)\s([A-Z])"
使其工作,因为
error: look-behind requires fixed-width pattern
我必须将此正则表达式应用到大量(且种类繁多的)文章中,因此我无法承受任何循环或 str.replace
暴力破解。我想知道是否有人可以提出 有效的解决方案 。
如果你不关心所有的Unicode大写字母,你可以使用
import re
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( re.sub(r"\b([A-Z]\w*)\s+(?=[A-Z])", r"_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.
参见Python demo. See the regex demo。 详情:
\b
- 单词边界
([A-Z]\w*)
- 第 1 组 (</code>):一个大写字母和零个或多个单词字符</li>
<li><code>\s+
- 一个或多个空格
(?=[A-Z])
- 匹配紧跟大写字母的位置的正面前瞻。
如果需要支持所有Unicode字母,建议pip install regex
并使用
import regex
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( regex.sub(r"\b(\p{Lu}\w*)\s+(?=\p{Lu})", r"_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.
参见 this Python demo。这里,\p{Lu}
匹配任何 Unicode 大写字母。
输入如下:
Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic.
我期待这样的输出:
Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.
我尝试使用正后视(使用 Python re
包)的解决方案是:
re.sub(r"(?<=\w)\s([A-Z])", r"_", above_string)
但是在这里,由于 \w
,我得到一个输出:
Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is_Novak_Djokovic.
当然,我无法使用 r"(?<=[A-Z]\w*)\s([A-Z])"
使其工作,因为
error: look-behind requires fixed-width pattern
我必须将此正则表达式应用到大量(且种类繁多的)文章中,因此我无法承受任何循环或 str.replace
暴力破解。我想知道是否有人可以提出 有效的解决方案 。
如果你不关心所有的Unicode大写字母,你可以使用
import re
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( re.sub(r"\b([A-Z]\w*)\s+(?=[A-Z])", r"_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.
参见Python demo. See the regex demo。 详情:
\b
- 单词边界([A-Z]\w*)
- 第 1 组 (</code>):一个大写字母和零个或多个单词字符</li> <li><code>\s+
- 一个或多个空格(?=[A-Z])
- 匹配紧跟大写字母的位置的正面前瞻。
如果需要支持所有Unicode字母,建议pip install regex
并使用
import regex
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( regex.sub(r"\b(\p{Lu}\w*)\s+(?=\p{Lu})", r"_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.
参见 this Python demo。这里,\p{Lu}
匹配任何 Unicode 大写字母。