从 XML 个标签创建原始文本
Create raw text from XML tags
我有一些通过 NLP 处理器运行的 XML。我必须修改 Python 脚本中的输出,所以我没有 XSLT。我正在尝试将 <TXT>
和 </TXT>
中的所有原始文本作为字符串从我的 XML 中提取,但我一直在研究如何从 ElementTree 中提取它。
到目前为止我的代码是
import xml.etree.ElementTree as ET
xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
<DOC>
<DOCID>112233</DOCID>
<TXT>
<S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
<S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
</TXT>
</DOC>
</NORMDOC>
"""
tree = ET.parse(xml_doc) # xml_doc is actually a file, but for reproducability it's the above xml
然后我想从那里提取 TXT 中的所有内容作为剥离标签的字符串。它必须是一些其他进程的字符串。我想看起来像下面的 output_txt
。
output_txt = "George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222."
我想这应该 相当简单明了,但我就是想不通。我尝试使用 this 解决方案,但我得到了 AttributeError: 'ElementTree' object has no attribute 'itertext'
,它会去除 xml 中的所有标签,而不是 <TXT>
和 </TXT>
之间的标签。
通常我会使用普通的 XPath 来执行此操作:
normalize-space(//TXT)
但是,ElementTree 中的 XPath 支持是有限的,因此您只能在 lxml 中执行此操作。
要在 ElementTree 中执行此操作,我会按照您在问题中链接到的答案进行操作;使用 method="text"
将其强制为 tostring
的纯文本。您还想规范化空格。
示例...
import xml.etree.ElementTree as ET
xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
<DOC>
<DOCID>112233</DOCID>
<TXT>
<S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
<S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
</TXT>
</DOC>
</NORMDOC>
"""
tree = ET.fromstring(xml_doc)
txt = tree.find(".//TXT")
raw_text = ET.tostring(txt, encoding='utf8', method='text').decode()
normalized_text = " ".join(raw_text.split())
print(normalized_text)
打印输出...
George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222.
我有一些通过 NLP 处理器运行的 XML。我必须修改 Python 脚本中的输出,所以我没有 XSLT。我正在尝试将 <TXT>
和 </TXT>
中的所有原始文本作为字符串从我的 XML 中提取,但我一直在研究如何从 ElementTree 中提取它。
到目前为止我的代码是
import xml.etree.ElementTree as ET
xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
<DOC>
<DOCID>112233</DOCID>
<TXT>
<S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
<S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
</TXT>
</DOC>
</NORMDOC>
"""
tree = ET.parse(xml_doc) # xml_doc is actually a file, but for reproducability it's the above xml
然后我想从那里提取 TXT 中的所有内容作为剥离标签的字符串。它必须是一些其他进程的字符串。我想看起来像下面的 output_txt
。
output_txt = "George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222."
我想这应该 相当简单明了,但我就是想不通。我尝试使用 this 解决方案,但我得到了 AttributeError: 'ElementTree' object has no attribute 'itertext'
,它会去除 xml 中的所有标签,而不是 <TXT>
和 </TXT>
之间的标签。
通常我会使用普通的 XPath 来执行此操作:
normalize-space(//TXT)
但是,ElementTree 中的 XPath 支持是有限的,因此您只能在 lxml 中执行此操作。
要在 ElementTree 中执行此操作,我会按照您在问题中链接到的答案进行操作;使用 method="text"
将其强制为 tostring
的纯文本。您还想规范化空格。
示例...
import xml.etree.ElementTree as ET
xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
<DOC>
<DOCID>112233</DOCID>
<TXT>
<S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
<S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
</TXT>
</DOC>
</NORMDOC>
"""
tree = ET.fromstring(xml_doc)
txt = tree.find(".//TXT")
raw_text = ET.tostring(txt, encoding='utf8', method='text').decode()
normalized_text = " ".join(raw_text.split())
print(normalized_text)
打印输出...
George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222.