使用 python 映射两个文本文档

Question

我已经注释了一些文本数据，现在我正在尝试将其与原始文本文件进行映射以获取更多信息。我在 JSON 文件中有注释的所有信息，我从中成功解析了所有相关信息。我存储了如下所示的信息。

列 = 实体 class
列 = 文本的起点
列 = 文本长度（以字符为单位）
列 = 实体标签的值
列 = 已注释的实际文本

我现在的目标是也包含未注释的文本。并非文本文档的每个句子或字符都被注释，但我想将它们包括在内以将所有信息提供给 DL 算法。因此，应包含每个未注释的句子，并显示“None”作为实体 class 和实体标签。

感谢任何提示或帮助！

谢谢！

Answer 1

您的注释文件中的信息不太准确。由于您去掉了空格，因此应适当调整文本的长度。

def map_with_text(data_file, ann_file, out_file):

    annots = []
    # Read annotation information
    with open(ann_file, 'r') as file_in:
        for line in file_in:
            components = line.split('t')
            components = line.split("\t")
            label = components[0]
            begin = int(components[1])
            length = int(components[2])
            f_4 = int(components[3])
            f_5 = int(components[4])
            text = components[5].strip()
            annots.append((label, begin, length, f_4, f_5, text))

    annots = sorted(annots, key=lambda c: c[1])

    # Read text data
    with open(data_file, 'r') as original:
        original_text = original.read()

    length_original = len(original_text)

    # Get positions of text already annotated. Since it was 
    # stripped, we cannot use the length. You can modify it if
    # you think your information is accurate.
    # pos_tup = [(begin, begin+length)
    #           for _, begin, length, _, _, text in annots]

    pos_tup = [(begin, begin+len(text))
               for _, begin, length, _, _, text in annots]

    # Get position marker
    pos_marker = [0] + [e for l in pos_tup for e in l] + [length_original]
    
    # Ranges of positions of text which have not been annotated
    not_ann_pos = [(x, y)
                   for x, y in zip(pos_marker[::2], pos_marker[1::2])]

    # Texts which have not been annotated
    not_ann_txt = [original_text[start:stop]
                   for start, stop in not_ann_pos]

    # Include it in the list
    all_components = [(None, start, len(txt.strip()), None, None, txt.strip())
                      for start, txt in zip(pos_marker[::2], not_ann_txt) if len(txt.strip()) != 0]

    # Add annotated information
    all_components += annots

    # Sort by the start index
    all_components = sorted(all_components, key=lambda c: c[1])

    # Write ot the output file
    with open(out_file, 'w') as f:
        for a in all_components:
            f.write(str(a[0]) + "\t" + str(a[1]) + "\t" + str(a[2]) +
                    "\t" + str(a[3]) + "\t" + str(a[4]) + "\t" + str(a[5]) + "\n")


map_with_text('0.txt', '0.ann', 'out0.tsv')

# You can loop calling the function
#
#

使用 python 映射两个文本文档

Mapping of two text documents with python

python

nlp

text-mining

dataframe

deep-learning