我怎样才能按照标准格式化这个文件?
How can I format this file as in the standard?
我有一个巨大的文件(语料库),其中包括单词及其词性标签,但也有一些我想删除的不相关信息。无关信息仅包含一定数量的字符。而1space是用来区分words-irrelevant informations-POS Tags。具体来说,句子中的每个单词都被一个换行符分割,句子被两个换行符分割。它具有以下格式:
My RRT PRP
Name DFEE NN
is PAAT VBZ
Selub KP NNP
. JUM .
Sentence_2
我想将此文件中的信息保存为句子数组,其中每个句子都是单词数组。如下:
[[('My', 'PRP'), ('name', 'NN'), ('is', 'VBZ'), ('Selub.', 'NNP'), ('.', '.')], ...]
作为 Python 的初学者,我将不胜感激。
我将你的句子分成两部分,这样我们就可以在输出中看到拆分部分
My RRT PRP
Name DFEE NN
is PAAT VBZ
Selub KP NNP
. JUM .
我们可以使用生成列表的生成器来划分我们的句子:
def splitter(lines):
sentence = []
for line in lines:
if not line.strip(): # empty line
if not sentence: # blanks before sentences
continue
else: # about to start new sentence
yield sentence
sentence = []
else:
word, _, tag = line.split() # Split the line
sentence.append((word, tag)) # Add to current sentence
yield sentence # Yield the last sentence
with open('infile.txt') as f:
list_of_sentences = list(splitter(f)) # consume the generator into a list
print(list_of_sentences)
# [[('My', 'PRP'), ('Name', 'NN')], [('is', 'VBZ'), ('Selub', 'NNP'), ('.', '.')]]
我有一个巨大的文件(语料库),其中包括单词及其词性标签,但也有一些我想删除的不相关信息。无关信息仅包含一定数量的字符。而1space是用来区分words-irrelevant informations-POS Tags。具体来说,句子中的每个单词都被一个换行符分割,句子被两个换行符分割。它具有以下格式:
My RRT PRP
Name DFEE NN
is PAAT VBZ
Selub KP NNP
. JUM .
Sentence_2
我想将此文件中的信息保存为句子数组,其中每个句子都是单词数组。如下:
[[('My', 'PRP'), ('name', 'NN'), ('is', 'VBZ'), ('Selub.', 'NNP'), ('.', '.')], ...]
作为 Python 的初学者,我将不胜感激。
我将你的句子分成两部分,这样我们就可以在输出中看到拆分部分
My RRT PRP
Name DFEE NN
is PAAT VBZ
Selub KP NNP
. JUM .
我们可以使用生成列表的生成器来划分我们的句子:
def splitter(lines):
sentence = []
for line in lines:
if not line.strip(): # empty line
if not sentence: # blanks before sentences
continue
else: # about to start new sentence
yield sentence
sentence = []
else:
word, _, tag = line.split() # Split the line
sentence.append((word, tag)) # Add to current sentence
yield sentence # Yield the last sentence
with open('infile.txt') as f:
list_of_sentences = list(splitter(f)) # consume the generator into a list
print(list_of_sentences)
# [[('My', 'PRP'), ('Name', 'NN')], [('is', 'VBZ'), ('Selub', 'NNP'), ('.', '.')]]