以大写字母拆分字符串,但前提是 Python 中后跟小写字母

Split a string at uppercase letters, but only if a lowercase letter follows in Python

我在 Python 中使用 pdfminer.six 来提取长文本数据。不幸的是,Miner 并不总是能很好地工作,尤其是在段落和文本换行方面。例如我得到以下输出:

"2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."

--> "2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below."

现在我想在小写字母后跟大写字母然后是小写字母(对于数字)时插入 space。所以最后 "2018Annual" 变成了 "2018 Annual""ReportInvesting" 变成了 "Report Investing",但是 "...CEO..." 仍然是 "...CEO...".

我只找到了 Split a string at uppercase letters and 的解决方案,但无法重写。不幸的是,我在 Python.

领域是全新的

我们可以在这里尝试使用 re.sub 作为正则表达式方法:

inp = "2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."
inp = re.sub(r'(?<![A-Z\W])(?=[A-Z])', ' ', inp)
print(inp)

这会打印:

2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below.

这里使用的正则表达式表示要在任何点插入一个 space :

(?<![A-Z\W])  what precedes is a word character EXCEPT
              for capital letters
(?=[A-Z])     and what follows is a capital letter

尝试使用正则表达式拆分:

import re
temp = re.sub(r"([A-Z][a-z]+)", r"", string).split()

string = ' '.join(temp)

我相信下面的代码给出了所需的结果。

temp = re.sub(r"([a-z])([A-Z])", r" ", text)
temp = re.sub(r"(\d)([A-Za-z])", r" ", temp)

我仍然觉得复杂的正则表达式有点挑战,因此需要将这个过程分成两个表达式。 也许更擅长正则表达式的人可以对此进行改进,以展示如何以更优雅的方式实现它。