如何根据最后一次出现的小写字母后跟大写字母来分隔数据框中的句子

Question

我有一个包含句子的数据框。第一句话（标题）之后是正文。它们在没有 space 的情况下合并。

我想根据小写字母后大写字母的最后一次出现将文本分成两部分（句子 1 和句子 2），中间没有 space（出于好奇我会也对基于第一次出现的解决方案感兴趣）。

解决方案应该存储在原始数据框中。

我试过了

re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*')

但无法解决。

import pandas
from pandas import DataFrame

Sentences = {'Sentence': ['RnB music all nightI love going out','Example sentence with no meaningThe space is missing.','Third exampleAlso numbers 1.23 and signs -. should appear in column 2.', 'BestMusic tonightAt 12:00.']}

df = DataFrame(Sentences,columns= ['Sentence'])

print(df)

因为拆分应该在最后一次出现时进行。示例中的 RnB 和 BestMusic 不应触发拆分。

df.Sentence1 = ['RnB music all night','Example sentence with no meaning','Third example', 'BestMusic tonight']

df.Sentence2 = ['I love going out','The space is missing.', 'Also numbers 1.23 and signs -. should appear in column 2.' ,'At 12:00.']

Answer 1

这是一种方法

Yourdf=df.Sentence.str.split(r'(.*[a-z])(?=[A-Z])',n=-1,expand=True)[[1,2]]
Yourdf
Out[610]: 
                                  1                                                  2
0               RnB music all night                                   I love going out
1  Example sentence with no meaning                              The space is missing.
2                     Third example  Also numbers 1.23 and signs -. should appear i...
3                 BestMusic tonight                                          At 12:00.

Answer 2

这仅在 A-Z 全部为大写字母时有效：

pattern = r'(?P<Sentence1>.*)(?P<Sentence2>[A-Z].*)$'
df['Sentence'].str.extract(pattern)

给出：

    Sentence1                           Sentence2
0   RnB music all night                 I love going out
1   Example sentence with no meaning    The space is missing.
2   Third example                       Also numbers 1.23 and signs -. should appear i...
3   BestMusic tonight                   At 12:00.

如何根据最后一次出现的小写字母后跟大写字母来分隔数据框中的句子

How to seperate sentences in a dataframe based on last occurence of small letter followed by a capital one

python

text

dataframe

sentiment-analysis

pandas