从预处理文本中提取 SVO 三元组

Extract SVO triples from preprocessed text

我需要从荷兰语文本中提取主谓宾三元组。文本由名为 Frog which tokenized, parsed, tagged, lemmatized,...it. Frog produces FoLiA XML, or tab-delimited column-formatted output, one line per token. Because of some problems with the XML file, I chose to work with the column format. This example represents one sentence. 的荷兰 NLP 工具分析,现在我需要为每个句子提取 SVO 三元组,因此我需要最后一列是依赖关系。所以我需要获取 ROOT 元素以及属于 ROOT 的 su 和 obj1 元素。不幸的是,例句没有 obj1。让我们假装它有。我的想法是首先创建一个嵌套列表,每个句子都有一个列表。

    import csv
    with open('romanfragment_frogged.tsv','r') as f:
         reader = csv.reader(f,delimiter='\t')
         tokens = []
         sentences = []
         list_of_sents = []
         for line in reader:
             tokens.append(line)
             #print(tokens)
             for token in tokens:
                 if token == '1':
                    previous_sentence = list_of_sents
                    sentences.append(previous_sentence)
         list_of_sents = []
         list_of_sents.append(tokens)
         print(list_of_sents)

当我打印 'tokens' 时,我得到一个包含所有标记的列表。所以这是正确的,但我仍在尝试创建一个嵌套列表,每个句子有 1 个(标记)列表。 有人可以帮我解决这个问题吗?

(P.S。第二个问题是我不确定,一旦我得到嵌套列表如何继续)

也许这样的方法可行:

def iter_sentences(fn):
    with open(fn, 'r') as f:
         reader = csv.reader(f,delimiter='\t')
         sentence = []
         for row in reader:
             if not row:
                # Ignore blank lines.
                continue
             if row[0] == '1' and sentence:
                 # A new sentence started.
                 yield sentence
                 sentence = []
             sentence.append(row)
         # Last sentence.
         if sentence:
             yield sentence

def iter_triples(fn):
    for sentence in iter_sentences(fn):
        # Get all subjects and objects.
        subjects = [tok for tok in sentence if tok[-1] == 'su']
        objects = [tok for tok in sentence if tok[-1] == 'obj1']
        # Now try to map them: find pairs with a head in the same position.
        for obj in objects:
            for subj in subjects:
                # row[-2] is the position of the head.
                if subj[-2] == obj[-2]:
                    # Matching subj-obj pair found.
                    # Now get the verb (the head of both subj and obj).
                    # Its position is given in the second-to-last column.
                    position = int(subj[-2])
                    # Subtract 1, as the positions start counting at 1.
                    verb = sentence[position-1]
                    yield subj, verb, obj

for subj, verb, obj in iter_triples('romanfragment_frogged.tsv'):
    # Only print the surface forms.
    print(subj[1], verb[1], obj[1])

快速说明: iter_sentences 遍历句子。 每个句子都是一个嵌套列表: 它是一个标记列表,每个标记本身就是一个列表(包含行号、表面形式、引理、POS、依赖关系等)。 iter_triples 函数迭代三元组 ‹subject, verb, object›。 这些三元组的每个元素都代表一个标记(即列表)。

最后三行代码只是一个如何使用iter_triples函数的例子。 我不知道每个三元组需要多少信息...