从预处理文本中提取 SVO 三元组
Extract SVO triples from preprocessed text
我需要从荷兰语文本中提取主谓宾三元组。文本由名为 Frog which tokenized, parsed, tagged, lemmatized,...it. Frog produces FoLiA XML, or tab-delimited column-formatted output, one line per token. Because of some problems with the XML file, I chose to work with the column format. This example represents one sentence. 的荷兰 NLP 工具分析,现在我需要为每个句子提取 SVO 三元组,因此我需要最后一列是依赖关系。所以我需要获取 ROOT 元素以及属于 ROOT 的 su 和 obj1 元素。不幸的是,例句没有 obj1。让我们假装它有。我的想法是首先创建一个嵌套列表,每个句子都有一个列表。
import csv
with open('romanfragment_frogged.tsv','r') as f:
reader = csv.reader(f,delimiter='\t')
tokens = []
sentences = []
list_of_sents = []
for line in reader:
tokens.append(line)
#print(tokens)
for token in tokens:
if token == '1':
previous_sentence = list_of_sents
sentences.append(previous_sentence)
list_of_sents = []
list_of_sents.append(tokens)
print(list_of_sents)
当我打印 'tokens' 时,我得到一个包含所有标记的列表。所以这是正确的,但我仍在尝试创建一个嵌套列表,每个句子有 1 个(标记)列表。
有人可以帮我解决这个问题吗?
(P.S。第二个问题是我不确定,一旦我得到嵌套列表如何继续)
也许这样的方法可行:
def iter_sentences(fn):
with open(fn, 'r') as f:
reader = csv.reader(f,delimiter='\t')
sentence = []
for row in reader:
if not row:
# Ignore blank lines.
continue
if row[0] == '1' and sentence:
# A new sentence started.
yield sentence
sentence = []
sentence.append(row)
# Last sentence.
if sentence:
yield sentence
def iter_triples(fn):
for sentence in iter_sentences(fn):
# Get all subjects and objects.
subjects = [tok for tok in sentence if tok[-1] == 'su']
objects = [tok for tok in sentence if tok[-1] == 'obj1']
# Now try to map them: find pairs with a head in the same position.
for obj in objects:
for subj in subjects:
# row[-2] is the position of the head.
if subj[-2] == obj[-2]:
# Matching subj-obj pair found.
# Now get the verb (the head of both subj and obj).
# Its position is given in the second-to-last column.
position = int(subj[-2])
# Subtract 1, as the positions start counting at 1.
verb = sentence[position-1]
yield subj, verb, obj
for subj, verb, obj in iter_triples('romanfragment_frogged.tsv'):
# Only print the surface forms.
print(subj[1], verb[1], obj[1])
快速说明:
iter_sentences
遍历句子。
每个句子都是一个嵌套列表:
它是一个标记列表,每个标记本身就是一个列表(包含行号、表面形式、引理、POS、依赖关系等)。
iter_triples
函数迭代三元组 ‹subject, verb, object›。
这些三元组的每个元素都代表一个标记(即列表)。
最后三行代码只是一个如何使用iter_triples
函数的例子。
我不知道每个三元组需要多少信息...
我需要从荷兰语文本中提取主谓宾三元组。文本由名为 Frog which tokenized, parsed, tagged, lemmatized,...it. Frog produces FoLiA XML, or tab-delimited column-formatted output, one line per token. Because of some problems with the XML file, I chose to work with the column format. This example represents one sentence.
import csv
with open('romanfragment_frogged.tsv','r') as f:
reader = csv.reader(f,delimiter='\t')
tokens = []
sentences = []
list_of_sents = []
for line in reader:
tokens.append(line)
#print(tokens)
for token in tokens:
if token == '1':
previous_sentence = list_of_sents
sentences.append(previous_sentence)
list_of_sents = []
list_of_sents.append(tokens)
print(list_of_sents)
当我打印 'tokens' 时,我得到一个包含所有标记的列表。所以这是正确的,但我仍在尝试创建一个嵌套列表,每个句子有 1 个(标记)列表。 有人可以帮我解决这个问题吗?
(P.S。第二个问题是我不确定,一旦我得到嵌套列表如何继续)
也许这样的方法可行:
def iter_sentences(fn):
with open(fn, 'r') as f:
reader = csv.reader(f,delimiter='\t')
sentence = []
for row in reader:
if not row:
# Ignore blank lines.
continue
if row[0] == '1' and sentence:
# A new sentence started.
yield sentence
sentence = []
sentence.append(row)
# Last sentence.
if sentence:
yield sentence
def iter_triples(fn):
for sentence in iter_sentences(fn):
# Get all subjects and objects.
subjects = [tok for tok in sentence if tok[-1] == 'su']
objects = [tok for tok in sentence if tok[-1] == 'obj1']
# Now try to map them: find pairs with a head in the same position.
for obj in objects:
for subj in subjects:
# row[-2] is the position of the head.
if subj[-2] == obj[-2]:
# Matching subj-obj pair found.
# Now get the verb (the head of both subj and obj).
# Its position is given in the second-to-last column.
position = int(subj[-2])
# Subtract 1, as the positions start counting at 1.
verb = sentence[position-1]
yield subj, verb, obj
for subj, verb, obj in iter_triples('romanfragment_frogged.tsv'):
# Only print the surface forms.
print(subj[1], verb[1], obj[1])
快速说明:
iter_sentences
遍历句子。
每个句子都是一个嵌套列表:
它是一个标记列表,每个标记本身就是一个列表(包含行号、表面形式、引理、POS、依赖关系等)。
iter_triples
函数迭代三元组 ‹subject, verb, object›。
这些三元组的每个元素都代表一个标记(即列表)。
最后三行代码只是一个如何使用iter_triples
函数的例子。
我不知道每个三元组需要多少信息...