将实体标签替换为 IOB 格式

Question

我正在尝试将非 IOB 标签转换为 conllu 文件中的 IOB。

文件的两个示例行是：

2 Ute Ute PROPN NE Case=Nom|Gender=Fem|Number=Sing 1 appos _ NE=PER_23|Morph=nsf

3 Wedemeier Wedemeier PROPN NE Case=Nom|Gender=Fem|Number=Sing 2 flat _ SpaceAfter=No|NE=PER_23|Morph=nsf

我想要

2 Ute Ute PROPN NE Case=Nom|Gender=Fem|Number=Sing 1 appos _ NE=B-PER|Morph=nsf

3 Wedemeier Wedemeier PROPN NE Case=Nom|Gender=Fem|Number=Sing 2 flat _ SpaceAfter=No|NE=I-PER|Morph=nsf

我现在想解析文件，将所有出现的“NE=NamedEntityTag_Number”更改为 IOB（类型不重要，只是每个“NE=field_type_number（在示例中“NE=PER_23”）到（NE=B-PER 和 NE=I-PER）。PER 可以是 list_of_fields 中的任何字段。因此，我创建了一个 list_of_fields出现命名实体标签。由于 conllu 文件保存为文本文件，我正在解析文本文件。由于并非所有行都包含命名实体标签，因此我首先检查是否有命名实体标签在行中，如果是，我检查下一行是否有相同的标签（包括相同的编号），以及之后的一行等等。这很重要：当下一行包含具有相同编号 id 的相同注释时，它属于同一个实体，因此，第一行必须是 B-PER，而该行的后续必须是 I-PER。

我正在尝试使用fileinput，只是为了改变NE的一部分。

希望有人能帮忙，谢谢！

`

import fileinput

import re

list_of_fields = ["PER", "ORG", "LOC", "GPE", "OTH"]

with fileinput.FileInput(file, inplace=True, backup=".bak") as file:
    for line in file:
        ne = [annotation for annotation in list_of_fields if (annotation in line)]
        if re.compile(r"^NE="+ne+"\_\d+$") in line:
            if re.compile(r"^NE="+ne+"\_\d+$") in next(line) == re.compile(r"^NE="+ne+"\_\d+$") in line:
                re.sub(r"^NE="+ne+"\_\d+$", r"NE=B-"+ne, line)
                re.sub(r"^NE="+ne+"\_\d+$", r"NE=I-"+ne, next(line))
            else:
                re.sub(r"^NE=" + ne + "\_\d+$", r"NE=B-" + ne, line)`

Answer 1

您必须保存最后一个字段和最后一个值才能跨多行进行比较。如果其中一个与下一个不同，则使用 B-<field> 进行替换，否则使用 I-<field>:

import fileinput
import re

list_of_fields = ["PER", "ORG", "LOC", "GPE", "OTH"]
joined_fields = f'({"|".join(list_of_fields)})'
field_pattern = re.compile(f'NE={joined_fields}')
last_field = last_value = None

with fileinput.FileInput(file, inplace=True, backup=".bak") as in_file,
     open('output.txt', 'wt') as out_file:

    for line in in_file:
        matches = re.findall(field_pattern, line)
        if not matches:
            # keep input
            out_file.write(line)
            continue
        field = matches[0] # assuming only one field per line
        start_index = line.find(f'NE={field}')
        end_index = line.find('|', start_index)
        value = re.findall(rf'{field}_(\d+)', line[start_index:end_index])[0]
        if field != last_field or value != last_value:
            replacement = f'B-{field}'
        else:
            replacement = f'I-{field}'
        last_field = field
        last_value = value
        new_line = re.sub(rf'{field}_{value}(-{joined_fields}_\d+)*', replacement, line)
        out_file.write(new_line)

编辑：允许多个字段，仅使用第一个

将实体标签替换为 IOB 格式

Replace to entity tags to IOB format

regex

named-entity-recognition

text-parsing

python-3.x