我们如何将 xml 与 lxml 分开?
How can we split an xml with lxml?
我正在寻找拆分以下内容的好方法xml
<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>
成片(本次为两片):
<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>
和
<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>
我正在试验下面的代码,但它看起来不是很优雅。有没有更好的方法来实现这一点?
from lxml import etree
starting_xml_string = '''<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>'''
root = etree.fromstring(starting_xml_string.encode('utf-8'))
# remove all children
envelope = deepcopy(root)
for mol in envelope:
envelope.remove(mol)
fragments = []
for fragment in root.getchildren():
tmp = deepcopy(envelope)
tmp.append(fragment)
tmp = etree.tostring(tmp, xml_declaration=True, encoding=root.getroottree().docinfo.encoding).decode('utf-8')
fragments.append(tmp)
非常感谢您的帮助。
我会按照以下方式处理它。请注意,您必须考虑命名空间,因此代码反映了:
#define a helper function
def cleanup(id):
root = etree.fromstring(starting_xml_string.encode('utf-8'))
#define an xpath expression
exp = f'//xx:MDocument[.//xx:molecule[@molID="{id}"]]'
target = root.xpath(exp,namespaces=ns)[0]
target.getparent().remove(target)
fn = f"myfile_without_{id}.xml"
with open(fn, 'w') as doc:
final = etree.tostring(root, xml_declaration=True, pretty_print = True)
doc.write(final.decode())
#declare namespaces
ns = {"xx":"http://www.chemaxon.com"}
#get your target molecule ids
ids = root.xpath('//xx:MDocument//xx:molecule/@molID',namespaces=ns)
for id in ids:
cleanup(id)
我正在寻找拆分以下内容的好方法xml
<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>
成片(本次为两片):
<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>
和
<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>
我正在试验下面的代码,但它看起来不是很优雅。有没有更好的方法来实现这一点?
from lxml import etree
starting_xml_string = '''<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>'''
root = etree.fromstring(starting_xml_string.encode('utf-8'))
# remove all children
envelope = deepcopy(root)
for mol in envelope:
envelope.remove(mol)
fragments = []
for fragment in root.getchildren():
tmp = deepcopy(envelope)
tmp.append(fragment)
tmp = etree.tostring(tmp, xml_declaration=True, encoding=root.getroottree().docinfo.encoding).decode('utf-8')
fragments.append(tmp)
非常感谢您的帮助。
我会按照以下方式处理它。请注意,您必须考虑命名空间,因此代码反映了:
#define a helper function
def cleanup(id):
root = etree.fromstring(starting_xml_string.encode('utf-8'))
#define an xpath expression
exp = f'//xx:MDocument[.//xx:molecule[@molID="{id}"]]'
target = root.xpath(exp,namespaces=ns)[0]
target.getparent().remove(target)
fn = f"myfile_without_{id}.xml"
with open(fn, 'w') as doc:
final = etree.tostring(root, xml_declaration=True, pretty_print = True)
doc.write(final.decode())
#declare namespaces
ns = {"xx":"http://www.chemaxon.com"}
#get your target molecule ids
ids = root.xpath('//xx:MDocument//xx:molecule/@molID',namespaces=ns)
for id in ids:
cleanup(id)