使用 python 映射两个文本文档
Mapping of two text documents with python
我已经注释了一些文本数据,现在我正在尝试将其与原始文本文件进行映射以获取更多信息。
我在 JSON 文件中有注释的所有信息,我从中成功解析了所有相关信息。我存储了如下所示的信息。
- 列 = 实体 class
- 列 = 文本的起点
- 列 = 文本长度(以字符为单位)
- 列 = 实体标签的值
- 列 = 已注释的实际文本
我现在的目标是也包含未注释的文本。并非文本文档的每个句子或字符都被注释,但我想将它们包括在内以将所有信息提供给 DL 算法。因此,应包含每个未注释的句子,并显示“None”作为实体 class 和实体标签。
感谢任何提示或帮助!
谢谢!
您的注释文件中的信息不太准确。由于您去掉了空格,因此应适当调整文本的长度。
def map_with_text(data_file, ann_file, out_file):
annots = []
# Read annotation information
with open(ann_file, 'r') as file_in:
for line in file_in:
components = line.split('t')
components = line.split("\t")
label = components[0]
begin = int(components[1])
length = int(components[2])
f_4 = int(components[3])
f_5 = int(components[4])
text = components[5].strip()
annots.append((label, begin, length, f_4, f_5, text))
annots = sorted(annots, key=lambda c: c[1])
# Read text data
with open(data_file, 'r') as original:
original_text = original.read()
length_original = len(original_text)
# Get positions of text already annotated. Since it was
# stripped, we cannot use the length. You can modify it if
# you think your information is accurate.
# pos_tup = [(begin, begin+length)
# for _, begin, length, _, _, text in annots]
pos_tup = [(begin, begin+len(text))
for _, begin, length, _, _, text in annots]
# Get position marker
pos_marker = [0] + [e for l in pos_tup for e in l] + [length_original]
# Ranges of positions of text which have not been annotated
not_ann_pos = [(x, y)
for x, y in zip(pos_marker[::2], pos_marker[1::2])]
# Texts which have not been annotated
not_ann_txt = [original_text[start:stop]
for start, stop in not_ann_pos]
# Include it in the list
all_components = [(None, start, len(txt.strip()), None, None, txt.strip())
for start, txt in zip(pos_marker[::2], not_ann_txt) if len(txt.strip()) != 0]
# Add annotated information
all_components += annots
# Sort by the start index
all_components = sorted(all_components, key=lambda c: c[1])
# Write ot the output file
with open(out_file, 'w') as f:
for a in all_components:
f.write(str(a[0]) + "\t" + str(a[1]) + "\t" + str(a[2]) +
"\t" + str(a[3]) + "\t" + str(a[4]) + "\t" + str(a[5]) + "\n")
map_with_text('0.txt', '0.ann', 'out0.tsv')
# You can loop calling the function
#
#
我已经注释了一些文本数据,现在我正在尝试将其与原始文本文件进行映射以获取更多信息。 我在 JSON 文件中有注释的所有信息,我从中成功解析了所有相关信息。我存储了如下所示的信息。
- 列 = 实体 class
- 列 = 文本的起点
- 列 = 文本长度(以字符为单位)
- 列 = 实体标签的值
- 列 = 已注释的实际文本
我现在的目标是也包含未注释的文本。并非文本文档的每个句子或字符都被注释,但我想将它们包括在内以将所有信息提供给 DL 算法。因此,应包含每个未注释的句子,并显示“None”作为实体 class 和实体标签。
感谢任何提示或帮助!
谢谢!
您的注释文件中的信息不太准确。由于您去掉了空格,因此应适当调整文本的长度。
def map_with_text(data_file, ann_file, out_file):
annots = []
# Read annotation information
with open(ann_file, 'r') as file_in:
for line in file_in:
components = line.split('t')
components = line.split("\t")
label = components[0]
begin = int(components[1])
length = int(components[2])
f_4 = int(components[3])
f_5 = int(components[4])
text = components[5].strip()
annots.append((label, begin, length, f_4, f_5, text))
annots = sorted(annots, key=lambda c: c[1])
# Read text data
with open(data_file, 'r') as original:
original_text = original.read()
length_original = len(original_text)
# Get positions of text already annotated. Since it was
# stripped, we cannot use the length. You can modify it if
# you think your information is accurate.
# pos_tup = [(begin, begin+length)
# for _, begin, length, _, _, text in annots]
pos_tup = [(begin, begin+len(text))
for _, begin, length, _, _, text in annots]
# Get position marker
pos_marker = [0] + [e for l in pos_tup for e in l] + [length_original]
# Ranges of positions of text which have not been annotated
not_ann_pos = [(x, y)
for x, y in zip(pos_marker[::2], pos_marker[1::2])]
# Texts which have not been annotated
not_ann_txt = [original_text[start:stop]
for start, stop in not_ann_pos]
# Include it in the list
all_components = [(None, start, len(txt.strip()), None, None, txt.strip())
for start, txt in zip(pos_marker[::2], not_ann_txt) if len(txt.strip()) != 0]
# Add annotated information
all_components += annots
# Sort by the start index
all_components = sorted(all_components, key=lambda c: c[1])
# Write ot the output file
with open(out_file, 'w') as f:
for a in all_components:
f.write(str(a[0]) + "\t" + str(a[1]) + "\t" + str(a[2]) +
"\t" + str(a[3]) + "\t" + str(a[4]) + "\t" + str(a[5]) + "\n")
map_with_text('0.txt', '0.ann', 'out0.tsv')
# You can loop calling the function
#
#