如何将 txt.knowtator.xml 文件转换为 .ann?
How to convert txt.knowtator.xml file to .ann?
我有一个 txt.knowtator.xml
格式的注释数据集
<?xml version="1.0" encoding="UTF-8"?>
<annotations textSource="file.txt">
<annotation>
<mention id="EHOST_Instance_93" />
<annotator id="01">Unknown</annotator>
<span start="127" end="237" />
<spannedText>Omeprazole</spannedText>
<creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
</annotation>
<classMention id="EHOST_Instance_93">
<mentionClass id="Treatment">Omeprazole</mentionClass>
</classMention>
<annotation>
<mention id="EHOST_Instance_94" />
<annotator id="01">Unkown</annotator>
<span start="600" end="612" />
<spannedText>Tegretol</spannedText>
<creationDate>Wed Mar 11 09:55:11 GMT 2010</creationDate>
</annotation>
<classMention id="EHOST_Instance_94">
<mentionClass id="Treatment">Tegretol</mentionClass>
</annotations>
我需要把它弄成standoff BRAT format(.ann
),比如:
T1 Treatment 127 137 Omeprazole
T2 Treatment 600 612 Tegretol
有converting/parsing可用的工具吗?
见下文
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<annotations textSource="file.txt">
<annotation>
<mention id="EHOST_Instance_93" />
<annotator id="01">Unknown</annotator>
<span start="127" end="237" />
<spannedText>Omeprazole</spannedText>
<creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
</annotation>
<classMention id="EHOST_Instance_93">
<mentionClass id="Treatment">Omeprazole</mentionClass>
</classMention>
</annotations>'''
root = ET.fromstring(xml)
print(f'T1 Treatment {root.find(".//span").attrib["start"]} {root.find(".//span").attrib["end"]} {root.find(".//spannedText").text}')
输出
T1 Treatment 127 237 Omeprazole
我有一个 txt.knowtator.xml
格式的注释数据集
<?xml version="1.0" encoding="UTF-8"?>
<annotations textSource="file.txt">
<annotation>
<mention id="EHOST_Instance_93" />
<annotator id="01">Unknown</annotator>
<span start="127" end="237" />
<spannedText>Omeprazole</spannedText>
<creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
</annotation>
<classMention id="EHOST_Instance_93">
<mentionClass id="Treatment">Omeprazole</mentionClass>
</classMention>
<annotation>
<mention id="EHOST_Instance_94" />
<annotator id="01">Unkown</annotator>
<span start="600" end="612" />
<spannedText>Tegretol</spannedText>
<creationDate>Wed Mar 11 09:55:11 GMT 2010</creationDate>
</annotation>
<classMention id="EHOST_Instance_94">
<mentionClass id="Treatment">Tegretol</mentionClass>
</annotations>
我需要把它弄成standoff BRAT format(.ann
),比如:
T1 Treatment 127 137 Omeprazole
T2 Treatment 600 612 Tegretol
有converting/parsing可用的工具吗?
见下文
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<annotations textSource="file.txt">
<annotation>
<mention id="EHOST_Instance_93" />
<annotator id="01">Unknown</annotator>
<span start="127" end="237" />
<spannedText>Omeprazole</spannedText>
<creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
</annotation>
<classMention id="EHOST_Instance_93">
<mentionClass id="Treatment">Omeprazole</mentionClass>
</classMention>
</annotations>'''
root = ET.fromstring(xml)
print(f'T1 Treatment {root.find(".//span").attrib["start"]} {root.find(".//span").attrib["end"]} {root.find(".//spannedText").text}')
输出
T1 Treatment 127 237 Omeprazole