如何根据最后一次出现的小写字母后跟大写字母来分隔数据框中的句子
How to seperate sentences in a dataframe based on last occurence of small letter followed by a capital one
我有一个包含句子的数据框。第一句话(标题)之后是正文。它们在没有 space 的情况下合并。
我想根据小写字母后大写字母的最后一次出现将文本分成两部分(句子 1 和句子 2),中间没有 space(出于好奇我会也对基于第一次出现的解决方案感兴趣)。
解决方案应该存储在原始数据框中。
我试过了
re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*')
但无法解决。
import pandas
from pandas import DataFrame
Sentences = {'Sentence': ['RnB music all nightI love going out','Example sentence with no meaningThe space is missing.','Third exampleAlso numbers 1.23 and signs -. should appear in column 2.', 'BestMusic tonightAt 12:00.']}
df = DataFrame(Sentences,columns= ['Sentence'])
print(df)
因为拆分应该在最后一次出现时进行。示例中的 RnB
和 BestMusic
不应触发拆分。
df.Sentence1 = ['RnB music all night','Example sentence with no meaning','Third example', 'BestMusic tonight']
df.Sentence2 = ['I love going out','The space is missing.', 'Also numbers 1.23 and signs -. should appear in column 2.' ,'At 12:00.']
这是一种方法
Yourdf=df.Sentence.str.split(r'(.*[a-z])(?=[A-Z])',n=-1,expand=True)[[1,2]]
Yourdf
Out[610]:
1 2
0 RnB music all night I love going out
1 Example sentence with no meaning The space is missing.
2 Third example Also numbers 1.23 and signs -. should appear i...
3 BestMusic tonight At 12:00.
这仅在 A-Z 全部为大写字母时有效:
pattern = r'(?P<Sentence1>.*)(?P<Sentence2>[A-Z].*)$'
df['Sentence'].str.extract(pattern)
给出:
Sentence1 Sentence2
0 RnB music all night I love going out
1 Example sentence with no meaning The space is missing.
2 Third example Also numbers 1.23 and signs -. should appear i...
3 BestMusic tonight At 12:00.
我有一个包含句子的数据框。第一句话(标题)之后是正文。它们在没有 space 的情况下合并。
我想根据小写字母后大写字母的最后一次出现将文本分成两部分(句子 1 和句子 2),中间没有 space(出于好奇我会也对基于第一次出现的解决方案感兴趣)。
解决方案应该存储在原始数据框中。
我试过了
re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*')
但无法解决。
import pandas
from pandas import DataFrame
Sentences = {'Sentence': ['RnB music all nightI love going out','Example sentence with no meaningThe space is missing.','Third exampleAlso numbers 1.23 and signs -. should appear in column 2.', 'BestMusic tonightAt 12:00.']}
df = DataFrame(Sentences,columns= ['Sentence'])
print(df)
因为拆分应该在最后一次出现时进行。示例中的 RnB
和 BestMusic
不应触发拆分。
df.Sentence1 = ['RnB music all night','Example sentence with no meaning','Third example', 'BestMusic tonight']
df.Sentence2 = ['I love going out','The space is missing.', 'Also numbers 1.23 and signs -. should appear in column 2.' ,'At 12:00.']
这是一种方法
Yourdf=df.Sentence.str.split(r'(.*[a-z])(?=[A-Z])',n=-1,expand=True)[[1,2]]
Yourdf
Out[610]:
1 2
0 RnB music all night I love going out
1 Example sentence with no meaning The space is missing.
2 Third example Also numbers 1.23 and signs -. should appear i...
3 BestMusic tonight At 12:00.
这仅在 A-Z 全部为大写字母时有效:
pattern = r'(?P<Sentence1>.*)(?P<Sentence2>[A-Z].*)$'
df['Sentence'].str.extract(pattern)
给出:
Sentence1 Sentence2
0 RnB music all night I love going out
1 Example sentence with no meaning The space is missing.
2 Third example Also numbers 1.23 and signs -. should appear i...
3 BestMusic tonight At 12:00.