如何将文本文件中的数据提取到定义为空白行之间的数据行的句子中?

How to pull data from a text file into sentences defined as rows of data in between blank rows?

数据位于一个文本文件中,我想将其中的数据组合成句子。句子的定义是所有行依次排列,每行至少有一个字符。有数据的行之间有空白行,所以我希望空白行标记句子的开头和结尾。有没有办法通过列表理解来做到这一点?

来自文本文件的示例。数据将如下所示:

This is the
first sentence.

This is a really long sentence
and it just keeps going across many
rows there will not necessarily be 
punctuation
or consistency in word length
the only difference in ending sentence
is the next row will be blank

here would be the third sentence
as 
you see
the blanks between rows of data 
help define what a sentence is

this would be sentence 4
i want to pull data
from text file
as such (in sentences) 
where sentences are defined with
blank records in between

this would be sentence 5 since blank row above it
and continues but ends because blank row(s) below it

您可以使用 file_as_string = file_object.read() 将整个文件作为单个字符串获取。因为你想在一个空行上拆分这个字符串,这相当于拆分两个后续的换行符,所以我们可以做 sentences = file_as_string.split("\n\n")。最后,您可能想要删除句子中间仍然存在的换行符。您可以通过列表推导来做到这一点,将换行符替换为空:sentences = [s.replace('\n', '') for s in sentences]

总共给出:

file_as_string = file_object.read()
sentences = file_as_string.split("\n\n")
sentences = [s.replace('\n', '') for s in sentences]

为此,您可以非常有效地使用正则表达式拆分。

如果只想按双空格分隔,请使用:

^[ \t]*$

Demo

在Python中,您可以:

import re   

with open(fn) as f_in:
    sentencences=re.split(r'\r?\n^[ \t]*$', f_in.read(), flags=re.M)

如果要删除个别\n的文字:

with open(fn) as f_in:
    sentencences=[re.sub(r'[ \t]*(?:\r?\n){1,}', ' ', s) 
         for s in re.split(r'\r?\n^[ \t]*$', f_in.read(), flags=re.M)]