我们如何将 xml 与 lxml 分开？

Question

我正在寻找拆分以下内容的好方法xml

<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>

成片（本次为两片）：

<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>

和

<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>

我正在试验下面的代码，但它看起来不是很优雅。有没有更好的方法来实现这一点？

from lxml import etree
starting_xml_string = '''<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>'''
root = etree.fromstring(starting_xml_string.encode('utf-8'))

# remove all children
envelope = deepcopy(root)
for mol in envelope:
    envelope.remove(mol)
fragments = []
for fragment in root.getchildren():
    tmp = deepcopy(envelope)
    tmp.append(fragment)
    tmp = etree.tostring(tmp, xml_declaration=True, encoding=root.getroottree().docinfo.encoding).decode('utf-8')
    fragments.append(tmp)

非常感谢您的帮助。

Answer 1

我会按照以下方式处理它。请注意，您必须考虑命名空间，因此代码反映了：

#define a helper function
def cleanup(id):
    root = etree.fromstring(starting_xml_string.encode('utf-8'))
    #define an xpath expression
    exp = f'//xx:MDocument[.//xx:molecule[@molID="{id}"]]'
    target = root.xpath(exp,namespaces=ns)[0]
    target.getparent().remove(target)
    fn = f"myfile_without_{id}.xml"
    with open(fn, 'w') as doc:
        final = etree.tostring(root, xml_declaration=True, pretty_print = True)
        doc.write(final.decode())
        
#declare namespaces
ns = {"xx":"http://www.chemaxon.com"}
#get your target molecule ids
ids = root.xpath('//xx:MDocument//xx:molecule/@molID',namespaces=ns)
for id in ids:    
    cleanup(id)

我们如何将 xml 与 lxml 分开？

How can we split an xml with lxml?

python

lxml