我怎样才能按照标准格式化这个文件？

Question

我有一个巨大的文件（语料库），其中包括单词及其词性标签，但也有一些我想删除的不相关信息。无关信息仅包含一定数量的字符。而1space是用来区分words-irrelevant informations-POS Tags。具体来说，句子中的每个单词都被一个换行符分割，句子被两个换行符分割。它具有以下格式：

My RRT PRP
Name DFEE NN
is  PAAT VBZ
Selub KP NNP
. JUM .   

Sentence_2

我想将此文件中的信息保存为句子数组，其中每个句子都是单词数组。如下：

[[('My', 'PRP'), ('name', 'NN'), ('is', 'VBZ'), ('Selub.', 'NNP'), ('.', '.')], ...]

作为 Python 的初学者，我将不胜感激。

Answer 1

我将你的句子分成两部分，这样我们就可以在输出中看到拆分部分

My RRT PRP
Name DFEE NN

is  PAAT VBZ
Selub KP NNP
. JUM .

我们可以使用生成列表的生成器来划分我们的句子：

def splitter(lines):
    sentence = []
    for line in lines:
        if not line.strip():  # empty line
            if not sentence:  # blanks before sentences
                continue
            else:  # about to start new sentence
                yield sentence
                sentence = []
        else:
            word, _, tag = line.split()  # Split the line
            sentence.append((word, tag))  # Add to current sentence
    yield sentence  # Yield the last sentence

with open('infile.txt') as f:
    list_of_sentences = list(splitter(f))  # consume the generator into a list
    print(list_of_sentences)
    # [[('My', 'PRP'), ('Name', 'NN')], [('is', 'VBZ'), ('Selub', 'NNP'), ('.', '.')]]

我怎样才能按照标准格式化这个文件？

How can I format this file as in the standard?

python

text-processing

part-of-speech