以大写字母拆分字符串,但前提是 Python 中后跟小写字母
Split a string at uppercase letters, but only if a lowercase letter follows in Python
我在 Python 中使用 pdfminer.six 来提取长文本数据。不幸的是,Miner 并不总是能很好地工作,尤其是在段落和文本换行方面。例如我得到以下输出:
"2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."
--> "2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below."
现在我想在小写字母后跟大写字母然后是小写字母(对于数字)时插入 space。所以最后 "2018Annual"
变成了 "2018 Annual"
而 "ReportInvesting"
变成了 "Report Investing"
,但是 "...CEO..."
仍然是 "...CEO..."
.
我只找到了 Split a string at uppercase letters and 的解决方案,但无法重写。不幸的是,我在 Python.
领域是全新的
我们可以在这里尝试使用 re.sub
作为正则表达式方法:
inp = "2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."
inp = re.sub(r'(?<![A-Z\W])(?=[A-Z])', ' ', inp)
print(inp)
这会打印:
2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below.
这里使用的正则表达式表示要在任何点插入一个 space :
(?<![A-Z\W]) what precedes is a word character EXCEPT
for capital letters
(?=[A-Z]) and what follows is a capital letter
尝试使用正则表达式拆分:
import re
temp = re.sub(r"([A-Z][a-z]+)", r"", string).split()
string = ' '.join(temp)
我相信下面的代码给出了所需的结果。
temp = re.sub(r"([a-z])([A-Z])", r" ", text)
temp = re.sub(r"(\d)([A-Za-z])", r" ", temp)
我仍然觉得复杂的正则表达式有点挑战,因此需要将这个过程分成两个表达式。
也许更擅长正则表达式的人可以对此进行改进,以展示如何以更优雅的方式实现它。
我在 Python 中使用 pdfminer.six 来提取长文本数据。不幸的是,Miner 并不总是能很好地工作,尤其是在段落和文本换行方面。例如我得到以下输出:
"2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."
--> "2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below."
现在我想在小写字母后跟大写字母然后是小写字母(对于数字)时插入 space。所以最后 "2018Annual"
变成了 "2018 Annual"
而 "ReportInvesting"
变成了 "Report Investing"
,但是 "...CEO..."
仍然是 "...CEO..."
.
我只找到了 Split a string at uppercase letters and 的解决方案,但无法重写。不幸的是,我在 Python.
领域是全新的我们可以在这里尝试使用 re.sub
作为正则表达式方法:
inp = "2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."
inp = re.sub(r'(?<![A-Z\W])(?=[A-Z])', ' ', inp)
print(inp)
这会打印:
2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below.
这里使用的正则表达式表示要在任何点插入一个 space :
(?<![A-Z\W]) what precedes is a word character EXCEPT
for capital letters
(?=[A-Z]) and what follows is a capital letter
尝试使用正则表达式拆分:
import re
temp = re.sub(r"([A-Z][a-z]+)", r"", string).split()
string = ' '.join(temp)
我相信下面的代码给出了所需的结果。
temp = re.sub(r"([a-z])([A-Z])", r" ", text)
temp = re.sub(r"(\d)([A-Za-z])", r" ", temp)
我仍然觉得复杂的正则表达式有点挑战,因此需要将这个过程分成两个表达式。 也许更擅长正则表达式的人可以对此进行改进,以展示如何以更优雅的方式实现它。