正则表达式 - 提取列表中以大写字母开头的子字符串，带有法语特殊符号

Question

我有一组像这样的法语字符串：

text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"

我想把大写字母开头的子串提取到列表中，如下：

list = ["Français", "Langues bantoues", "Presse écrite", "Gabon", "Particularité linguistique"]

我确实尝试过类似的方法，但它不接受以下单词，并且由于法语符号而停止。

import re
pattern = "([A-Z][a-z]+)"

text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"

list = re.findall(pattern, text)
list

输出 ['Fran', 'Langues', 'Presse', 'Gabon', 'Particularit']

不幸的是我没能在论坛上找到解决方案。

Answer 1

由于这与特定的 Unicode 字符处理有关，我建议使用 PyPi regex module（使用 pip install regex 安装），然后您可以使用

import regex
text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"
matches = regex.split(r'(?!\A)\b(?=\p{Lu})', text)
print( list(map(lambda x: x.strip(), matches)) )
# => ['Français', 'Langues bantoues', 'Presse écrite', 'Gabon', 'Particularité linguistique']

见online Python demo and the regex demo。详情:

(?!\A) - 字符串开头以外的位置
\b - 单词边界
(?=\p{Lu}) - 要求下一个字符为 Unicode 大写字母的正面前瞻。

请注意，map(lambda x: x.strip(), matches) 用于从结果块中去除多余的空白。

你 也可以用 re 做到这一点:

import re, sys
text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
matches = re.split(fr'(?!\A)\b(?={pLu})', text)
print( list(map(lambda x: x.strip(), matches)) )
# => ['Français', 'Langues bantoues', 'Presse écrite', 'Gabon', 'Particularité linguistique']

参见this Python demo，但请记住支持的 Unicode 大写字母的数量因版本而异，使用 PyPi 正则表达式模块使其更加一致。

正则表达式 - 提取列表中以大写字母开头的子字符串，带有法语特殊符号

Regex - Extract substrings starting with capitalized letter in a list, with french special symbols

python

regex

french