pandas：除非点前有数字或字符，否则按点拆分

Question

我有一个数据框如下：

import pandas as pd
df = pd.DataFrame({'text':['she is a. good 15. year old girl. she goes to school on time.', 'she is not an A. level student. This needs to be discussed.']})

要在 (.) 上拆分和爆炸，我做了以下操作：

df = df.assign(text=df['text'].str.split('.')).explode('text')

但是我不想在每个点之后拆分。所以我想在点上拆分，除非点被数字包围（例如 22.、3.4）或围绕点的单个字符（例如 a. ,a.b., b.d

desired_output:

   text
'she is a. good 15. year old girl'
'she goes to school on time'
'she is not an A. level student'
'This needs to be discussed.'

所以，我也尝试了以下模式，希望忽略单个字符和数字，但它从句子的最后一个单词中删除了最后一个字母。

df.assign(text=df['text'].str.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+')).explode('text')

我编辑了模式，所以现在它匹配数字或单个字母后的所有类型的点：r'(?:(?<=.|\s)[[a-zA-Z]]。 |(?<=.|\s)\d+)+' 所以，我想我只需要弄清楚如何在点上拆分，除了最后一个模式

Answer 1

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import re

input = 'she is a. good 15. year old girl. she goes to school on time. she is not an A. level student. This needs to be discussed.'

sentences = re.split(r'\.', input)

output = []
text = ''
for v in sentences:
    text = text + v

    if(re.search(r'(^|\s)([a-z]{1}|[0-9]+)$', v, re.IGNORECASE)):
        text = text + "."
    else:
        text = text.strip()
        if text != '':
            output.append(text)
        text = ''

print(output)

输出：

['she is a. good 15. year old girl', 'she goes to school on time', 'she is not an A. level student', 'This needs to be discussed']

pandas：除非点前有数字或字符，否则按点拆分

pandas: split on dot unless there is a number or a character before dot

python

regex

text

split

pandas