以大写字母拆分字符串，但前提是 Python 中后跟小写字母

Question

我在 Python 中使用 pdfminer.six 来提取长文本数据。不幸的是，Miner 并不总是能很好地工作，尤其是在段落和文本换行方面。例如我得到以下输出：

"2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."

--> "2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below."

现在我想在小写字母后跟大写字母然后是小写字母（对于数字）时插入 space。所以最后 "2018Annual" 变成了 "2018 Annual" 而 "ReportInvesting" 变成了 "Report Investing"，但是 "...CEO..." 仍然是 "...CEO...".

我只找到了 Split a string at uppercase letters and 的解决方案，但无法重写。不幸的是，我在 Python.

领域是全新的

Answer 1

我们可以在这里尝试使用 re.sub 作为正则表达式方法：

inp = "2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."
inp = re.sub(r'(?<![A-Z\W])(?=[A-Z])', ' ', inp)
print(inp)

这会打印：

2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below.

这里使用的正则表达式表示要在任何点插入一个 space ：

(?<![A-Z\W])  what precedes is a word character EXCEPT
              for capital letters
(?=[A-Z])     and what follows is a capital letter

Answer 2

尝试使用正则表达式拆分：

import re
temp = re.sub(r"([A-Z][a-z]+)", r"", string).split()

string = ' '.join(temp)

Answer 3

我相信下面的代码给出了所需的结果。

temp = re.sub(r"([a-z])([A-Z])", r" ", text)
temp = re.sub(r"(\d)([A-Za-z])", r" ", temp)

我仍然觉得复杂的正则表达式有点挑战，因此需要将这个过程分成两个表达式。也许更擅长正则表达式的人可以对此进行改进，以展示如何以更优雅的方式实现它。

以大写字母拆分字符串，但前提是 Python 中后跟小写字母

Split a string at uppercase letters, but only if a lowercase letter follows in Python

python

split

text-mining

uppercase