pandas:除非点前有数字或字符,否则按点拆分
pandas: split on dot unless there is a number or a character before dot
我有一个数据框如下:
import pandas as pd
df = pd.DataFrame({'text':['she is a. good 15. year old girl. she goes to school on time.', 'she is not an A. level student. This needs to be discussed.']})
要在 (.) 上拆分和爆炸,我做了以下操作:
df = df.assign(text=df['text'].str.split('.')).explode('text')
但是我不想在每个点之后拆分。所以我想在点上拆分,除非点被数字包围(例如 22.、3.4)或围绕点的单个字符(例如 a. ,a.b., b.d
desired_output:
text
'she is a. good 15. year old girl'
'she goes to school on time'
'she is not an A. level student'
'This needs to be discussed.'
所以,我也尝试了以下模式,希望忽略单个字符和数字,但它从句子的最后一个单词中删除了最后一个字母。
df.assign(text=df['text'].str.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+')).explode('text')
我编辑了模式,所以现在它匹配数字或单个字母后的所有类型的点:r'(?:(?<=.|\s)[[a-zA-Z]]。 |(?<=.|\s)\d+)+'
所以,我想我只需要弄清楚如何在点上拆分,除了最后一个模式
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
input = 'she is a. good 15. year old girl. she goes to school on time. she is not an A. level student. This needs to be discussed.'
sentences = re.split(r'\.', input)
output = []
text = ''
for v in sentences:
text = text + v
if(re.search(r'(^|\s)([a-z]{1}|[0-9]+)$', v, re.IGNORECASE)):
text = text + "."
else:
text = text.strip()
if text != '':
output.append(text)
text = ''
print(output)
输出:
['she is a. good 15. year old girl', 'she goes to school on time', 'she is not an A. level student', 'This needs to be discussed']
我有一个数据框如下:
import pandas as pd
df = pd.DataFrame({'text':['she is a. good 15. year old girl. she goes to school on time.', 'she is not an A. level student. This needs to be discussed.']})
要在 (.) 上拆分和爆炸,我做了以下操作:
df = df.assign(text=df['text'].str.split('.')).explode('text')
但是我不想在每个点之后拆分。所以我想在点上拆分,除非点被数字包围(例如 22.、3.4)或围绕点的单个字符(例如 a. ,a.b., b.d
desired_output:
text
'she is a. good 15. year old girl'
'she goes to school on time'
'she is not an A. level student'
'This needs to be discussed.'
所以,我也尝试了以下模式,希望忽略单个字符和数字,但它从句子的最后一个单词中删除了最后一个字母。
df.assign(text=df['text'].str.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+')).explode('text')
我编辑了模式,所以现在它匹配数字或单个字母后的所有类型的点:r'(?:(?<=.|\s)[[a-zA-Z]]。 |(?<=.|\s)\d+)+' 所以,我想我只需要弄清楚如何在点上拆分,除了最后一个模式
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
input = 'she is a. good 15. year old girl. she goes to school on time. she is not an A. level student. This needs to be discussed.'
sentences = re.split(r'\.', input)
output = []
text = ''
for v in sentences:
text = text + v
if(re.search(r'(^|\s)([a-z]{1}|[0-9]+)$', v, re.IGNORECASE)):
text = text + "."
else:
text = text.strip()
if text != '':
output.append(text)
text = ''
print(output)
输出:
['she is a. good 15. year old girl', 'she goes to school on time', 'she is not an A. level student', 'This needs to be discussed']