使用 Python Pandas 从 XML/Json 创建 CSV
Create CSV from XML/Json using Python Pandas
我正在尝试将 xml 解析为多个不同的文件 -
样本XML
<integration-outbound:IntegrationEntity
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<integrationEntityHeader>
<integrationTrackingNumber>281#963-4c1d-9d26-877ba40a4b4b#1583507840354</integrationTrackingNumber>
<referenceCodeForEntity>25428</referenceCodeForEntity>
<attachments>
<attachment>
<id>d6esd1d518b06019e01</id>
<name>durance.pdf</name>
<size>0</size>
</attachment>
<attachment>
<id>182e60164ddd4236b5bd96109</id>
<name>ssds</name>
<size>0</size>
</attachment>
</attachments>
<source>SIM</source>
<entity>SUPPLIER</entity>
<action>CREATE</action>
<timestampUTC>20200306T151721</timestampUTC>
<zDocBaseVersion>2.0</zDocBaseVersion>
<zDocCustomVersion>0</zDocCustomVersion>
</integrationEntityHeader>
<integrationEntityDetails>
<supplier>
<requestId>2614352</requestId>
<controlBlock>
<dataProcessingInfo>
<key>MODE</key>
<value>Onboarding</value>
</dataProcessingInfo>
<dataProcessingInfo>
<key>Supplier_Type</key>
<value>Operational</value>
</dataProcessingInfo>
</controlBlock>
<id>1647059</id>
<facilityCode>0001</facilityCode>
<systemCode>1</systemCode>
<supplierType>Operational</supplierType>
<systemFacilityDetails>
<systemFacilityDetail>
<facilityCode>0001</facilityCode>
<systemCode>1</systemCode>
<FacilityStatus>ACTIVE</FacilityStatus>
</systemFacilityDetail>
</systemFacilityDetails>
<status>ACTIVE</status>
<companyDetails>
<displayGSID>254232128</displayGSID>
<legalCompanyName>asdasdsads</legalCompanyName>
<dunsNumber>03-175-2493</dunsNumber>
<legalStructure>1</legalStructure>
<website>www.aaadistributor.com</website>
<noEmp>25</noEmp>
<companyIndicator1099>No</companyIndicator1099>
<taxidAndWxformRequired>NO</taxidAndWxformRequired>
<taxidFormat>Fed. Tax</taxidFormat>
<wxForm>182e601649ade4c38cd4236b5bd96109</wxForm>
<taxid>27-2204474</taxid>
<companyTypeFix>SUPPLIER</companyTypeFix>
<fields>
<field>
<id>LOW_CUURENT_SERV</id>
<value>1</value>
</field>
<field>
<id>LOW_COI</id>
<value>USA</value>
</field>
<field>
<id>LOW_STATE_INCO</id>
<value>US-PA</value>
</field>
<field>
<id>CERT_INSURANCE</id>
<value>d6e6e460fe8958564c1d518b06019e01</value>
</field>
<field>
<id>COMP_DBA</id>
<value>asdadas</value>
</field>
<field>
<id>LOW_AREUDIVE</id>
<value>N</value>
</field>
<field>
<id>LOW_BU_SIZE1</id>
<value>SMLBUS</value>
</field>
<field>
<id>EDI_CAP</id>
<value>Y</value>
</field>
<field>
<id>EDI_WEB</id>
<value>N</value>
</field>
<field>
<id>EDI_TRAD</id>
<value>N</value>
</field>
</fields>
</companyDetails>
<allLocations>
<location>
<addressInternalid>1704342</addressInternalid>
<isDelete>false</isDelete>
<internalSupplierid>1647059</internalSupplierid>
<acctGrpid>HQ</acctGrpid>
<address1>2501 GRANT AVE</address1>
<country>USA</country>
<state>US-PA</state>
<city>PHILADELPHIA</city>
<zip>19114</zip>
<phone>(215) 745-7900</phone>
</location>
</allLocations>
<contactDetails>
<contactDetail>
<contactInternalid>12232</contactInternalid>
<isDelete>false</isDelete>
<addressInternalid>1704312142</addressInternalid>
<contactType>Main</contactType>
<firstName>Raf</firstName>
<lastName>jas</lastName>
<title>Admin</title>
<email>abcd@gmail.com</email>
<phoneNo>123-42-23-23</phoneNo>
<createPortalLogin>yes</createPortalLogin>
<allowedPortalSideProducts>SIM,iSource,iContract</allowedPortalSideProducts>
</contactDetail>
<contactDetail>
<contactInternalid>1944938</contactInternalid>
<isDelete>false</isDelete>
<addressInternalid>1704342</addressInternalid>
<contactType>Rad</contactType>
<firstName>AVs</firstName>
<lastName>asd</lastName>
<title>Founder</title>
<email>as@sds.com</email>
<phoneNo>21521-2112-7900</phoneNo>
<createPortalLogin>yes</createPortalLogin>
<allowedPortalSideProducts>SIM,iContract,iSource</allowedPortalSideProducts>
</contactDetail>
</contactDetails>
<myLocation>
<addresses>
<myLocationsInternalid>1704342</myLocationsInternalid>
<isDelete>false</isDelete>
<addressInternalid>1704342</addressInternalid>
<usedAt>N</usedAt>
</addresses>
</myLocation>
<bankDetails>
<fields>
<field>
<id>LOW_BANK_KEY</id>
<value>123213</value>
</field>
<field>
<id>LOW_EFT</id>
<value>123123</value>
</field>
</fields>
</bankDetails>
<forms>
<form>
<id>CATEGORY_PRODSER</id>
<records>
<record>
<Internalid>24348</Internalid>
<isDelete>false</isDelete>
<fields>
<field>
<id>CATEGOR_LEVEL_1</id>
<value>MR</value>
</field>
<field>
<id>LOW_PRODSERV</id>
<value>RES</value>
</field>
<field>
<id>LOW_LEVEL_2</id>
<value>keylevel221</value>
</field>
<field>
<id>LOW_LEVEL_3</id>
<value>keylevel3127</value>
</field>
<field>
<id>LOW_LEVEL_4</id>
<value>keylevel4434</value>
</field>
<field>
<id>LOW_LEVEL_5</id>
<value>keylevel5545</value>
</field>
</fields>
</record>
<record>
<Internalid>24349</Internalid>
<isDelete>false</isDelete>
<fields>
<field>
<id>CATEGOR_LEVEL_1</id>
<value>MR</value>
</field>
<field>
<id>LOW_PRODSERV</id>
<value>RES</value>
</field>
<field>
<id>LOW_LEVEL_2</id>
<value>keylevel221</value>
</field>
<field>
<id>LOW_LEVEL_3</id>
<value>keylevel3125</value>
</field>
<field>
<id>LOW_LEVEL_4</id>
<value>keylevel4268</value>
</field>
<field>
<id>LOW_LEVEL_5</id>
<value>keylevel5418</value>
</field>
</fields>
</record>
<record>
<Internalid>24350</Internalid>
<isDelete>false</isDelete>
<fields>
<field>
<id>CATEGOR_LEVEL_1</id>
<value>MR</value>
</field>
<field>
<id>LOW_PRODSERV</id>
<value>RES</value>
</field>
<field>
<id>LOW_LEVEL_2</id>
<value>keylevel221</value>
</field>
<field>
<id>LOW_LEVEL_3</id>
<value>keylevel3122</value>
</field>
<field>
<id>LOW_LEVEL_4</id>
<value>keylevel425</value>
</field>
<field>
<id>LOW_LEVEL_5</id>
<value>keylevel5221</value>
</field>
</fields>
</record>
</records>
</form>
<form>
<id>OTHER_INFOR</id>
<records>
<record>
<isDelete>false</isDelete>
<fields>
<field>
<id>S_EAST</id>
<value>N</value>
</field>
<field>
<id>W_EST</id>
<value>N</value>
</field>
<field>
<id>M_WEST</id>
<value>N</value>
</field>
<field>
<id>N_EAST</id>
<value>N</value>
</field>
<field>
<id>LOW_AREYOU_ASSET</id>
<value>-1</value>
</field>
<field>
<id>LOW_SWART_PROG</id>
<value>-1</value>
</field>
</fields>
</record>
</records>
</form>
<form>
<id>ABDCEDF</id>
<records>
<record>
<isDelete>false</isDelete>
<fields>
<field>
<id>LOW_COD_CONDUCT</id>
<value>-1</value>
</field>
</fields>
</record>
</records>
</form>
<form>
<id>CODDUC</id>
<records>
<record>
<isDelete>false</isDelete>
<fields>
<field>
<id>LOW_SUPPLIER_TYPE</id>
<value>2</value>
</field>
<field>
<id>LOW_DO_INT_BOTH</id>
<value>1</value>
</field>
</fields>
</record>
</records>
</form>
</forms>
</supplier>
</integrationEntityDetails>
</integration-outbound:IntegrationEntity>
目标是实现通用 xml 到 csv 的转换。根据输入文件,xml 应该被展平并分解成多个 csv 并存储。
输入是上面的 xml 和下面的配置 csv 文件。需要用文件中提到的相应 XPATH 创建 3 个 csv 文件
XPATH,ColumName,CSV_File_Name,ParentKey
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/id,id,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/name,name,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/size,size,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/source,source,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/entity,entity,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/action,action,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/timestampUTC,timestampUTC,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/zDocBaseVersion,zDocBaseVersion,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/zDocCustomVersion,zDocCustomVersion,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsControlBlock.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsControlBlock.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/requestId,requestId,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/controlBlock/dataProcessingInfo[]/key,key,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/controlBlock/dataProcessingInfo[]/value,value,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/id,supplier_id,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/id,id,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/Internalid,Internalid,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/isDelete,FormId,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/fields/field[]/id,SupplierFormRecordFieldId,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/fields/field[]/value,SupplierFormRecordFieldValue,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/requestId,requestId,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/id,supplier_id,integrationEntityDetailsForms.csv,Y
我需要创建 3 个 csv 文件输出。
设计是选择每个 csv 文件并获取 xpath 并从 xml 中选择相应的值并获取它
第 1 步 - 将 xml 转换为 Json -
import json
import xmltodict
with open("/home/s0998hws/test.xml") as xml_file:
data_dict = xmltodict.parse(xml_file.read())
xml_file.close()
# generate the object using json.dumps()
# corresponding to json data
json_data = json.dumps(data_dict)
# Write the json data to output
# json file
with open("data.json", "w") as json_file:
json_file.write(json_data)
json_file.close()
with open('data.json') as f:
d = json.load(f)
第 2 步 - 使用 panda 标准化函数进行标准化 -
使用 xpath / 转换为 .和 [] 作为其他分隔符并构建要从 json 中提取的列,即代码将查找 /integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber 并转换为
.integrationEntityHeader.integrationTrackingNumber 并且第一个 [] 会爆炸 ,
df_1=pd.json_normalize(data=d['integration-outbound:IntegrationEntity'])
df_2=df_1[['integrationEntityHeader.integrationTrackingNumber','integrationEntityDetails.supplier.requestId','integrationEntityHeader.referenceCodeForEntity','integrationEntityDetails.supplier.id','integrationEntityDetails.supplier.forms.form']]
df_3=df_2.explode('integrationEntityDetails.supplier.forms.form')
df_3['integrationEntityDetails.supplier.forms.form.id']=df_3['integrationEntityDetails.supplier.forms.form'].apply(lambda x: x.get('id'))
df_3['integrationEntityDetails.supplier.forms.form.records']=df_3['integrationEntityDetails.supplier.forms.form'].apply(lambda x: x.get('records'))
我试图使用 csv 文件中的元数据并对其进行处理,但挑战是
df_3['integrationEntityDetails.supplier.forms.form.records.record.Internalid']=df_3['integrationEntityDetails.supplier.forms.form.records.record'].apply(lambda x: x.get('Internalid'))
因错误而失败 -
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python3.6/site-packages/pandas/core/series.py", line 3848, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2327, in pandas._libs.lib.map_infer
File "<stdin>", line 1, in <lambda>
AttributeError: 'list' object has no attribute 'get'
原因是panda dataframe的数据有list when和array,用上面的方法无法获取
下面是生成的输出
integrationEntityHeader.integrationTrackingNumber integrationEntityDetails.supplier.requestId integrationEntityHeader.referenceCodeForEntity integrationEntityDetails.supplier.id integrationEntityDetails.supplier.forms.form integrationEntityDetails.supplier.forms.form.id integrationEntityDetails.supplier.forms.form.records
0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'CATEGORY_PRODSER', 'records': {'record': [{'Internalid': '24348', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3127'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4434'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5545'}]}}, {'Internalid': '24349', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3125'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4268'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5418'}]}}, {'Internalid': '24350', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3122'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel425'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5221'}]}}]}} CATEGORY_PRODSER {'record': [{'Internalid': '24348', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3127'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4434'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5545'}]}}, {'Internalid': '24349', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3125'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4268'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5418'}]}}, {'Internalid': '24350', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3122'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel425'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5221'}]}}]}
0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'OTHER_INFOR', 'records': {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'S_EAST', 'value': 'N'}, {'id': 'W_EST', 'value': 'N'}, {'id': 'M_WEST', 'value': 'N'}, {'id': 'N_EAST', 'value': 'N'}, {'id': 'LOW_AREYOU_ASSET', 'value': '-1'}, {'id': 'LOW_SWART_PROG', 'value': '-1'}]}}}} OTHER_INFOR {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'S_EAST', 'value': 'N'}, {'id': 'W_EST', 'value': 'N'}, {'id': 'M_WEST', 'value': 'N'}, {'id': 'N_EAST', 'value': 'N'}, {'id': 'LOW_AREYOU_ASSET', 'value': '-1'}, {'id': 'LOW_SWART_PROG', 'value': '-1'}]}}}
0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'CORPORATESUSTAINABILITY', 'records': {'record': {'isDelete': 'false', 'fields': {'field': {'id': 'LOW_COD_CONDUCT', 'value': '-1'}}}}} CORPORATESUSTAINABILITY {'record': {'isDelete': 'false', 'fields': {'field': {'id': 'LOW_COD_CONDUCT', 'value': '-1'}}}}
0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'PRODUCTSERVICES', 'records': {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'LOW_SUPPLIER_TYPE', 'value': '2'}, {'id': 'LOW_DO_INT_BOTH', 'value': '1'}]}}}} PRODUCTSERVICES {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'LOW_SUPPLIER_TYPE', 'value': '2'}, {'id': 'LOW_DO_INT_BOTH', 'value': '1'}]}}}
预期输出
integrationEntityDetailsForms.csv
integrationTrackingNumber requestId referenceCodeForEntity supplier.id integrationEntityDetails.supplier.forms.form.id InternalId isDelete SupplierFormRecordFieldId SupplierFormRecordFieldValue
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE CATEGOR_LEVEL_1 MR
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_PRODSERV RES
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_2 keylevel221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_3 keylevel3127
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_4 keylevel4434
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_5 keylevel5545
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE CATEGOR_LEVEL_1 MR
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_PRODSERV RES
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_2 keylevel221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_3 keylevel3122
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_4 keylevel425
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_5 keylevel5221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE S_EAST N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE W_EST N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE M_WEST N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE N_EAST N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE LOW_AREYOU_ASSET -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CORPORATESUSTAINABILITY FALSE LOW_SWART_PROG -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CORPORATESUSTAINABILITY FALSE LOW_COD_CONDUCT -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 PRODUCTSERVICES FALSE LOW_SUPPLIER_TYPE 2
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 PRODUCTSERVICES FALSE LOW_DO_INT_BOTH 1
我认为问题中缺少这一行:
df_3['integrationEntityDetails.supplier.forms.form.records.record'] = (
df_3['integrationEntityDetails.supplier.forms.form.records'].apply(
lambda x: x.get('record')
)
)
那么,对于 Internalid,你可以这样做:
df_3['integrationEntityDetails.supplier.forms.form.records.record.Internalid'] = (
df_3['integrationEntityDetails.supplier.forms.form.records.record'].apply(
lambda x: x[0].get('Internalid') if type(x) == list else x.get('Internalid')
)
)
考虑 XSLT, the special purpose language designed to transform XML files like flattening them at certain sections. Python's third-party module, lxml,可以 运行 XSLT 1.0 脚本和 XPath 1.0 表达式。
具体来说,XSLT 可以处理您的 XPath 提取。然后,从单个转换后的结果树中,构建所需的三个数据框。为了结构良好,下面假定以下根和数据结构:
<integration-outbound:IntegrationEntity
xmlns:integration-outbound="http://example.com"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
...same content...
</integration-outbound:IntegrationEntity>
XSLT (另存为.xsl,一个特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:integration-outbound="http://example.com"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="integration-outbound:IntegrationEntity">
<data>
<xsl:apply-templates select="integrationEntityHeader/descendant::attachment"/>
<xsl:apply-templates select="integrationEntityDetails/descendant::dataProcessingInfo"/>
<xsl:apply-templates select="integrationEntityDetails/descendant::forms/descendant::field"/>
</data>
</xsl:template>
<xsl:template match="attachment">
<integrationEntityHeader>
<xsl:copy-of select="ancestor::integrationEntityHeader/*[name()!='attachments']"/>
<xsl:copy-of select="*"/>
</integrationEntityHeader>
</xsl:template>
<xsl:template match="dataProcessingInfo">
<integrationEntityDetailsControlBlock>
<xsl:copy-of select="ancestor::integration-outbound:IntegrationEntity/integrationEntityHeader/*[position() <= 2]"/>
<requestId><xsl:value-of select="ancestor::supplier/requestId"/></requestId>
<supplier_id><xsl:value-of select="ancestor::supplier/id"/></supplier_id>
<xsl:copy-of select="*"/>
</integrationEntityDetailsControlBlock>
</xsl:template>
<xsl:template match="field">
<integrationEntityDetailsForms>
<form_id><xsl:value-of select="ancestor::form/id"/></form_id>
<xsl:copy-of select="ancestor::record/*[name()!='fields']"/>
<SupplierFormRecordFieldId><xsl:value-of select="id"/></SupplierFormRecordFieldId>
<SupplierFormRecordFieldValue><xsl:value-of select="id"/></SupplierFormRecordFieldValue>
<xsl:copy-of select="ancestor::integration-outbound:IntegrationEntity/integrationEntityHeader/*[position() <= 2]"/>
<requestId><xsl:value-of select="ancestor::supplier/requestId"/></requestId>
<supplier_id><xsl:value-of select="ancestor::supplier/id"/></supplier_id>
</integrationEntityDetailsForms>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as et
import pandas as pd
# LOAD XML AND XSL
doc = et.parse('Input.xml')
style = et.parse('Script.xsl')
# INITIALIZE AND RUN TRANSFORMATION
transformer = et.XSLT(style)
flat_doc = transformer(doc)
# BUILD THREE DATA FRAMES
df_header = pd.DataFrame([{i.tag:i.text for i in el}
for el in flat_doc.xpath('integrationEntityHeader')])
df_detailsControlBlock = pd.DataFrame([{i.tag:i.text for i in el}
for el in flat_doc.xpath('integrationEntityDetailsControlBlock')])
df_detailsForms = pd.DataFrame([{i.tag:i.text for i in el}
for el in flat_doc.xpath('integrationEntityDetailsForms')])
把xml转成dict,然后写解析逻辑,因为json也是一样的。 Whosebug 非常有用,解决方案是根据所有这些链接的响应构建的。为简单起见,我创建了一个 3 层嵌套 xml。这适用于 Python3
<?xml version="1.0"?><Company><Employee><FirstName>Hal</FirstName><LastName>Thanos</LastName><ContactNo>122131</ContactNo><Email>hal.thanos@xyz.com</Email><Addresses><Address><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Iron</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>iron.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
<?xml version="1.0"?><Company><Employee><FirstName>Captain</FirstName><LastName>America</LastName><ContactNo>13322</ContactNo><Email>captain.america@xyz.com</Email><Addresses><Address><City>Trivandrum</City><State>Kerala</State><Zip>28115</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Sword</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>sword.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
<?xml version="1.0"?><Company><Employee><FirstName>Thor</FirstName><LastName>Odison</LastName><ContactNo>156565</ContactNo><Email>thor.odison@xyz.com</Email><Addresses><Address><City>Tirunelveli</City><State>TamilNadu</State><Zip>36595</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Spider</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>spider.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
<?xml version="1.0"?><Company><Employee><FirstName>Black</FirstName><LastName>Widow</LastName><ContactNo>16767</ContactNo><Email>black.widow@xyz.com</Email><Addresses><Address><City>Mysore</City><State>Karnataka</State><Zip>12478</Zip><forms><form><id>ID1</id><value>LIC</value></form></forms></Address></Addresses></Employee><Employee><FirstName>White</FirstName><LastName>Man</LastName><ContactNo>5634</ContactNo><Email>white.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
这个 xml 的配置文件是所有可能的 array/multiple level/explode 列应该被提及为 []。如代码中所述,需要 header。
根据您的商店更改变量
process_config_csv = 'config.csv'
xml_file_name = 'test.xml'
XPATH,ColumName,CSV_File_Name
/Company/Employee[]/FirstName,FirstName,Name.csv
/Company/Employee[]/LastName,LastName,Name.csv
/Company/Employee[]/ContactNo,ContactNo,Name.csv
/Company/Employee[]/Email,Email,Name.csv
/Company/Employee[]/FirstName,FirstName,Address.csv
/Company/Employee[]/LastName,LastName,Address.csv
/Company/Employee[]/ContactNo,ContactNo,Address.csv
/Company/Employee[]/Email,Email,Address.csv
/Company/Employee[]/Addresses/Address[]/City,City,Address.csv
/Company/Employee[]/Addresses/Address[]/State,State,Address.csv
/Company/Employee[]/Addresses/Address[]/Zip,Zip,Address.csv
/Company/Employee[]/Addresses/Address[]/type,type,Address.csv
/Company/Employee[]/FirstName,FirstName,Form.csv
/Company/Employee[]/LastName,LastName,Form.csv
/Company/Employee[]/ContactNo,ContactNo,Form.csv
/Company/Employee[]/Email,Email,Form.csv
/Company/Employee[]/Addresses/Address[]/type,type,Form.csv
/Company/Employee[]/Addresses/Address[]/forms/form[]/id,id,Form.csv
/Company/Employee[]/Addresses/Address[]/forms/form[]/value,value,Form.csv
根据配置文件创建多个csv的代码是
import json
import xmltodict
import json
import os
import csv
import numpy as np
import pandas as pd
import sys
from collections import defaultdict
import numpy as np
def getMatches(L1, L2):
R = set()
for elm in L1:
for pat in L2:
if elm.find(pat) != -1:
if elm.find('.', len(pat)+1) != -1:
R.add(elm[:elm.find('.', len(pat)+1)])
else:
R.add(elm)
return list(R)
def xml_parse(xml_file_name):
try:
process_xml_file = xml_file_name
with open(process_xml_file) as xml_file:
for xml_string in xml_file:
"""Converting the xml to Dict"""
data_dict = xmltodict.parse(xml_string)
"""Converting the dict to Pandas DF"""
df_processing = pd.json_normalize(data_dict)
xml_parse_loop(df_processing)
xml_file.close()
except Exception as e:
s = str(e)
print(s)
def xml_parse_loop(df_processing_input):
CSV_File_Name = []
"""Getting the list of csv Files to be created"""
with open(process_config_csv, newline='') as csvfile:
DataCaptured = csv.DictReader(csvfile)
for row in DataCaptured:
if row['CSV_File_Name'] not in CSV_File_Name:
CSV_File_Name.append(row['CSV_File_Name'])
"""Iterating the list of CSV"""
for items in CSV_File_Name:
df_processing = df_processing_input
df_subset_process = []
df_subset_list_all_cols = []
df_process_sub_explode_Level = []
df_final_column_name = []
print('Parsing the xml file for creating the file - ' + str(items))
"""Fetching the field list for processs from the confic File"""
with open(process_config_csv, newline='') as csvfile:
DataCaptured = csv.DictReader(csvfile)
for row in DataCaptured:
if row['CSV_File_Name'] in items:
df_final_column_name.append(row['ColumName'])
"""Getting the columns until the first [] """
df_subset_process.append(row['XPATH'].strip('/').replace("/",".").split('[]')[0])
"""Getting the All the columnnames"""
df_subset_list_all_cols.append(row['XPATH'].strip('/').replace("/",".").replace("[]",""))
"""Getting the All the Columns to explode"""
df_process_sub_explode_Level.append(row['XPATH'].strip('/').replace('/', '.').split('[]'))
explode_ld = defaultdict(set)
"""Putting Level of explode and column names"""
for x in df_process_sub_explode_Level:
if len(x) > 1:
explode_ld[len(x) - 1].add(''.join(x[: -1]))
explode_ld = {k: list(v) for k, v in explode_ld.items()}
#print(' The All column list is for the file ' + items + " is " + str(df_subset_list_all_cols))
#print(' The first processing for the file ' + items + " is " + str(df_subset_process))
#print('The explode level of attributes for the file ' + items + " is " + str(explode_ld))
"""Remove column duplciates"""
df_subset_process = list(dict.fromkeys(df_subset_process))
for col in df_subset_process:
if col not in df_processing.columns:
df_processing[col] = np.nan
df_processing = df_processing[df_subset_process]
df_processing_col_list = df_processing.columns.tolist()
print ('The total levels to be exploded : %d' % len(explode_ld))
i=0
level=len(explode_ld)
for i in range(level):
print (' Exploding the Level : %d' % i )
df_processing_col_list = df_processing.columns.tolist()
list_of_explode=set(df_processing_col_list) & set(explode_ld[i + 1])
#print('List to expolde' + str(list_of_explode))
"""If founc in explode list exlplode some xml doesnt need to have a list it could be column handling the same"""
for c in list_of_explode:
print (' There are column present which needs to be exploded - ' + str(c))
df_processing = pd.concat((df_processing.iloc[[type(item) == list for item in df_processing[c]]].explode(c),df_processing.iloc[[type(item) != list for item in df_processing[c]]]))
print(' Finding the columns need to be fetched ')
"""From the overall column list fecthing the attributes needed to explode"""
next_level_pro_lst = getMatches(df_subset_list_all_cols,explode_ld[ i + 1 ])
#print(next_level_pro_lst)
df_processing_col_list = df_processing.columns.tolist()
for nex in next_level_pro_lst:
#print ("Fetching " + nex.rsplit('.', 1)[1] + ' from ' + nex.rsplit('.', 1)[0] + ' from ' + nex )
parent_col=nex.rsplit('.', 1)[0]
child_col=nex.rsplit('.', 1)[1]
#print(parent_col)
#print(df_processing_col_list)
if parent_col not in df_processing_col_list:
df_processing[nex.rsplit('.', 1)[0]] = ""
try:
df_processing[nex] = df_processing[parent_col].apply(lambda x: x.get(child_col))
except AttributeError:
df_processing[nex] = ""
df_processing_col_list = df_processing.columns.tolist()
if i == level-1:
print('Last Level nothing to be done')
else:
"""Extracting All columns until the next exlode column list is found"""
while len(set(df_processing_col_list) & set(explode_ld[i + 2]))==0:
next_level_pro_lst = getMatches(df_subset_list_all_cols, next_level_pro_lst)
#print(next_level_pro_lst)
for nextval in next_level_pro_lst:
if nextval not in df_processing_col_list:
#print("Fetching " + nextval.rsplit('.', 1)[1] + ' from ' + nextval.rsplit('.', 1)[0] + ' from ' + nextval)
if nextval.rsplit('.', 1)[0] not in df_processing.columns:
df_processing[nextval.rsplit('.', 1)[0]] = ""
try:
df_processing[nextval] = df_processing[nextval.rsplit('.', 1)[0]].apply(lambda x: x.get(nextval.rsplit('.', 1)[1]))
except AttributeError:
df_processing[nextval] = ""
df_processing_col_list = df_processing.columns.tolist()
df_processing = df_processing[df_subset_list_all_cols]
df_processing.columns = df_final_column_name
# if file does not exist write header
if not os.path.isfile(items):
print("The file does not exists Exists so writing new")
df_processing.to_csv('{}'.format(items), header='column_names',index=None)
else: # else it exists so append without writing the header
print("The file does exists Exists so appending")
df_processing.to_csv('{}'.format(items), mode='a', header=False,index=None)
from datetime import datetime
startTime = datetime.now().strftime("%Y%m%d_%H%M%S")
startTime = str(os.getpid()) + "_" + startTime
process_task_name = ''
process_config_csv = 'config.csv'
xml_file_name = 'test.xml'
old_print = print
def timestamped_print(*args, **kwargs):
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
printheader = now + " xml_parser " + " " + process_task_name + " - "
old_print(printheader, *args, **kwargs)
print = timestamped_print
xml_parse(xml_file_name)
创建的输出是
[, ~]$ cat Name.csv
FirstName,LastName,ContactNo,Email
Hal,Thanos,122131,hal.thanos@xyz.com
Iron,Man,12324,iron.man@xyz.com
Captain,America,13322,captain.america@xyz.com
Sword,Man,12324,sword.man@xyz.com
Thor,Odison,156565,thor.odison@xyz.com
Spider,Man,12324,spider.man@xyz.com
Black,Widow,16767,black.widow@xyz.com
White,Man,5634,white.man@xyz.com
[, ~]$ cat Address.csv
FirstName,LastName,ContactNo,Email,City,State,Zip,type
Iron,Man,12324,iron.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Iron,Man,12324,iron.man@xyz.com,Concord,NC,28027,Temporary
Hal,Thanos,122131,hal.thanos@xyz.com,Bangalore,Karnataka,560212,
Sword,Man,12324,sword.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Sword,Man,12324,sword.man@xyz.com,Concord,NC,28027,Temporary
Captain,America,13322,captain.america@xyz.com,Trivandrum,Kerala,28115,
Spider,Man,12324,spider.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Spider,Man,12324,spider.man@xyz.com,Concord,NC,28027,Temporary
Thor,Odison,156565,thor.odison@xyz.com,Tirunelveli,TamilNadu,36595,
White,Man,5634,white.man@xyz.com,Bangalore,Karnataka,560212,Permanent
White,Man,5634,white.man@xyz.com,Concord,NC,28027,Temporary
Black,Widow,16767,black.widow@xyz.com,Mysore,Karnataka,12478,
[, ~]$ cat Form.csv
FirstName,LastName,ContactNo,Email,type,id,value
Iron,Man,12324,iron.man@xyz.com,Temporary,ID1,LIC
Iron,Man,12324,iron.man@xyz.com,Temporary,ID2,PAS
Iron,Man,12324,iron.man@xyz.com,Temporary,ID3,SSN
Iron,Man,12324,iron.man@xyz.com,Temporary,ID2,CC
Hal,Thanos,122131,hal.thanos@xyz.com,,ID1,LIC
Hal,Thanos,122131,hal.thanos@xyz.com,,ID2,PAS
Iron,Man,12324,iron.man@xyz.com,Permanent,ID3,LIC
Sword,Man,12324,sword.man@xyz.com,Temporary,ID1,LIC
Sword,Man,12324,sword.man@xyz.com,Temporary,ID2,PAS
Sword,Man,12324,sword.man@xyz.com,Temporary,ID3,SSN
Sword,Man,12324,sword.man@xyz.com,Temporary,ID2,CC
Captain,America,13322,captain.america@xyz.com,,ID1,LIC
Captain,America,13322,captain.america@xyz.com,,ID2,PAS
Sword,Man,12324,sword.man@xyz.com,Permanent,ID3,LIC
Spider,Man,12324,spider.man@xyz.com,Temporary,ID1,LIC
Spider,Man,12324,spider.man@xyz.com,Temporary,ID2,PAS
Spider,Man,12324,spider.man@xyz.com,Temporary,ID3,SSN
Spider,Man,12324,spider.man@xyz.com,Temporary,ID2,CC
Thor,Odison,156565,thor.odison@xyz.com,,ID1,LIC
Thor,Odison,156565,thor.odison@xyz.com,,ID2,PAS
Spider,Man,12324,spider.man@xyz.com,Permanent,ID3,LIC
White,Man,5634,white.man@xyz.com,Temporary,ID1,LIC
White,Man,5634,white.man@xyz.com,Temporary,ID2,PAS
White,Man,5634,white.man@xyz.com,Temporary,ID3,SSN
White,Man,5634,white.man@xyz.com,Temporary,ID2,CC
White,Man,5634,white.man@xyz.com,Permanent,ID3,LIC
Black,Widow,16767,black.widow@xyz.com,,ID1,LIC
这些片段和答案是从不同的线程中提取的,感谢
@Mark Tolonen @Mandy007 @deadshot
Create a dict of list using python from csv
https://whosebug.com/questions/62837949/extract-a-list-from-a-list
How to explode Panda column with data having different dict and list of dict
这绝对可以变得更短、性能更好,并且可以进一步增强
Spark 来救援!
以下代码是在 Scala 中编写的,但如果您愿意,可以轻松将其转换为 Python。
Databrick's XML library 使 XML 处理变得容易。
val headers = spark.read.format("xml").option("rowTag", "integrationEntityHeader").load("WhosebugRafaXML.xml")
headers.write.csv(<headerFilename>) // Create CSV from the header file
val details = spark.read.format("xml").option("rowTag", "integrationEntityDetails").load("WhosebugRafaXML.xml")
// The details need further unnesting. To get suppliers, for instance, you can do
val supplier = spark.read.format("xml").option("rowTag", "supplier").load("WhosebugRafaXML.xml")
supplier.show
+--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+
| allLocations| bankDetails| companyDetails| contactDetails| controlBlock|facilityCode| forms| id| myLocation|requestId|status|supplierType|systemCode|systemFacilityDetails|
+--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+
|[[HQ, 2501 GRANT ...|[[[[LOW_BANK_KEY,...|[No, SUPPLIER, 25...|[[[1704312142, SI...|[[[MODE, Onboardi...| 1|[[[CATEGORY_PRODS...|1647059|[[1704342, false,...| 2614352|ACTIVE| Operational| 1| [[ACTIVE, 1, 1]]|
+--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+
xml 的格式对我来说有点陌生。
你试过了吗pandas_read_xml?
pip install pandas_read_xml
你可以这样做
import pandas_read_xml as pdx
df = pdx.read_xml('filename.xml')
要展平,您可以
df = pdx.flatten(df)
或
df = pdx.fully_flatten(df)
我正在尝试将 xml 解析为多个不同的文件 -
样本XML
<integration-outbound:IntegrationEntity
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<integrationEntityHeader>
<integrationTrackingNumber>281#963-4c1d-9d26-877ba40a4b4b#1583507840354</integrationTrackingNumber>
<referenceCodeForEntity>25428</referenceCodeForEntity>
<attachments>
<attachment>
<id>d6esd1d518b06019e01</id>
<name>durance.pdf</name>
<size>0</size>
</attachment>
<attachment>
<id>182e60164ddd4236b5bd96109</id>
<name>ssds</name>
<size>0</size>
</attachment>
</attachments>
<source>SIM</source>
<entity>SUPPLIER</entity>
<action>CREATE</action>
<timestampUTC>20200306T151721</timestampUTC>
<zDocBaseVersion>2.0</zDocBaseVersion>
<zDocCustomVersion>0</zDocCustomVersion>
</integrationEntityHeader>
<integrationEntityDetails>
<supplier>
<requestId>2614352</requestId>
<controlBlock>
<dataProcessingInfo>
<key>MODE</key>
<value>Onboarding</value>
</dataProcessingInfo>
<dataProcessingInfo>
<key>Supplier_Type</key>
<value>Operational</value>
</dataProcessingInfo>
</controlBlock>
<id>1647059</id>
<facilityCode>0001</facilityCode>
<systemCode>1</systemCode>
<supplierType>Operational</supplierType>
<systemFacilityDetails>
<systemFacilityDetail>
<facilityCode>0001</facilityCode>
<systemCode>1</systemCode>
<FacilityStatus>ACTIVE</FacilityStatus>
</systemFacilityDetail>
</systemFacilityDetails>
<status>ACTIVE</status>
<companyDetails>
<displayGSID>254232128</displayGSID>
<legalCompanyName>asdasdsads</legalCompanyName>
<dunsNumber>03-175-2493</dunsNumber>
<legalStructure>1</legalStructure>
<website>www.aaadistributor.com</website>
<noEmp>25</noEmp>
<companyIndicator1099>No</companyIndicator1099>
<taxidAndWxformRequired>NO</taxidAndWxformRequired>
<taxidFormat>Fed. Tax</taxidFormat>
<wxForm>182e601649ade4c38cd4236b5bd96109</wxForm>
<taxid>27-2204474</taxid>
<companyTypeFix>SUPPLIER</companyTypeFix>
<fields>
<field>
<id>LOW_CUURENT_SERV</id>
<value>1</value>
</field>
<field>
<id>LOW_COI</id>
<value>USA</value>
</field>
<field>
<id>LOW_STATE_INCO</id>
<value>US-PA</value>
</field>
<field>
<id>CERT_INSURANCE</id>
<value>d6e6e460fe8958564c1d518b06019e01</value>
</field>
<field>
<id>COMP_DBA</id>
<value>asdadas</value>
</field>
<field>
<id>LOW_AREUDIVE</id>
<value>N</value>
</field>
<field>
<id>LOW_BU_SIZE1</id>
<value>SMLBUS</value>
</field>
<field>
<id>EDI_CAP</id>
<value>Y</value>
</field>
<field>
<id>EDI_WEB</id>
<value>N</value>
</field>
<field>
<id>EDI_TRAD</id>
<value>N</value>
</field>
</fields>
</companyDetails>
<allLocations>
<location>
<addressInternalid>1704342</addressInternalid>
<isDelete>false</isDelete>
<internalSupplierid>1647059</internalSupplierid>
<acctGrpid>HQ</acctGrpid>
<address1>2501 GRANT AVE</address1>
<country>USA</country>
<state>US-PA</state>
<city>PHILADELPHIA</city>
<zip>19114</zip>
<phone>(215) 745-7900</phone>
</location>
</allLocations>
<contactDetails>
<contactDetail>
<contactInternalid>12232</contactInternalid>
<isDelete>false</isDelete>
<addressInternalid>1704312142</addressInternalid>
<contactType>Main</contactType>
<firstName>Raf</firstName>
<lastName>jas</lastName>
<title>Admin</title>
<email>abcd@gmail.com</email>
<phoneNo>123-42-23-23</phoneNo>
<createPortalLogin>yes</createPortalLogin>
<allowedPortalSideProducts>SIM,iSource,iContract</allowedPortalSideProducts>
</contactDetail>
<contactDetail>
<contactInternalid>1944938</contactInternalid>
<isDelete>false</isDelete>
<addressInternalid>1704342</addressInternalid>
<contactType>Rad</contactType>
<firstName>AVs</firstName>
<lastName>asd</lastName>
<title>Founder</title>
<email>as@sds.com</email>
<phoneNo>21521-2112-7900</phoneNo>
<createPortalLogin>yes</createPortalLogin>
<allowedPortalSideProducts>SIM,iContract,iSource</allowedPortalSideProducts>
</contactDetail>
</contactDetails>
<myLocation>
<addresses>
<myLocationsInternalid>1704342</myLocationsInternalid>
<isDelete>false</isDelete>
<addressInternalid>1704342</addressInternalid>
<usedAt>N</usedAt>
</addresses>
</myLocation>
<bankDetails>
<fields>
<field>
<id>LOW_BANK_KEY</id>
<value>123213</value>
</field>
<field>
<id>LOW_EFT</id>
<value>123123</value>
</field>
</fields>
</bankDetails>
<forms>
<form>
<id>CATEGORY_PRODSER</id>
<records>
<record>
<Internalid>24348</Internalid>
<isDelete>false</isDelete>
<fields>
<field>
<id>CATEGOR_LEVEL_1</id>
<value>MR</value>
</field>
<field>
<id>LOW_PRODSERV</id>
<value>RES</value>
</field>
<field>
<id>LOW_LEVEL_2</id>
<value>keylevel221</value>
</field>
<field>
<id>LOW_LEVEL_3</id>
<value>keylevel3127</value>
</field>
<field>
<id>LOW_LEVEL_4</id>
<value>keylevel4434</value>
</field>
<field>
<id>LOW_LEVEL_5</id>
<value>keylevel5545</value>
</field>
</fields>
</record>
<record>
<Internalid>24349</Internalid>
<isDelete>false</isDelete>
<fields>
<field>
<id>CATEGOR_LEVEL_1</id>
<value>MR</value>
</field>
<field>
<id>LOW_PRODSERV</id>
<value>RES</value>
</field>
<field>
<id>LOW_LEVEL_2</id>
<value>keylevel221</value>
</field>
<field>
<id>LOW_LEVEL_3</id>
<value>keylevel3125</value>
</field>
<field>
<id>LOW_LEVEL_4</id>
<value>keylevel4268</value>
</field>
<field>
<id>LOW_LEVEL_5</id>
<value>keylevel5418</value>
</field>
</fields>
</record>
<record>
<Internalid>24350</Internalid>
<isDelete>false</isDelete>
<fields>
<field>
<id>CATEGOR_LEVEL_1</id>
<value>MR</value>
</field>
<field>
<id>LOW_PRODSERV</id>
<value>RES</value>
</field>
<field>
<id>LOW_LEVEL_2</id>
<value>keylevel221</value>
</field>
<field>
<id>LOW_LEVEL_3</id>
<value>keylevel3122</value>
</field>
<field>
<id>LOW_LEVEL_4</id>
<value>keylevel425</value>
</field>
<field>
<id>LOW_LEVEL_5</id>
<value>keylevel5221</value>
</field>
</fields>
</record>
</records>
</form>
<form>
<id>OTHER_INFOR</id>
<records>
<record>
<isDelete>false</isDelete>
<fields>
<field>
<id>S_EAST</id>
<value>N</value>
</field>
<field>
<id>W_EST</id>
<value>N</value>
</field>
<field>
<id>M_WEST</id>
<value>N</value>
</field>
<field>
<id>N_EAST</id>
<value>N</value>
</field>
<field>
<id>LOW_AREYOU_ASSET</id>
<value>-1</value>
</field>
<field>
<id>LOW_SWART_PROG</id>
<value>-1</value>
</field>
</fields>
</record>
</records>
</form>
<form>
<id>ABDCEDF</id>
<records>
<record>
<isDelete>false</isDelete>
<fields>
<field>
<id>LOW_COD_CONDUCT</id>
<value>-1</value>
</field>
</fields>
</record>
</records>
</form>
<form>
<id>CODDUC</id>
<records>
<record>
<isDelete>false</isDelete>
<fields>
<field>
<id>LOW_SUPPLIER_TYPE</id>
<value>2</value>
</field>
<field>
<id>LOW_DO_INT_BOTH</id>
<value>1</value>
</field>
</fields>
</record>
</records>
</form>
</forms>
</supplier>
</integrationEntityDetails>
</integration-outbound:IntegrationEntity>
目标是实现通用 xml 到 csv 的转换。根据输入文件,xml 应该被展平并分解成多个 csv 并存储。
输入是上面的 xml 和下面的配置 csv 文件。需要用文件中提到的相应 XPATH 创建 3 个 csv 文件
XPATH,ColumName,CSV_File_Name,ParentKey
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/id,id,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/name,name,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/size,size,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/source,source,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/entity,entity,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/action,action,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/timestampUTC,timestampUTC,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/zDocBaseVersion,zDocBaseVersion,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/zDocCustomVersion,zDocCustomVersion,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsControlBlock.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsControlBlock.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/requestId,requestId,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/controlBlock/dataProcessingInfo[]/key,key,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/controlBlock/dataProcessingInfo[]/value,value,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/id,supplier_id,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/id,id,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/Internalid,Internalid,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/isDelete,FormId,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/fields/field[]/id,SupplierFormRecordFieldId,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/fields/field[]/value,SupplierFormRecordFieldValue,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/requestId,requestId,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/id,supplier_id,integrationEntityDetailsForms.csv,Y
我需要创建 3 个 csv 文件输出。
设计是选择每个 csv 文件并获取 xpath 并从 xml 中选择相应的值并获取它
第 1 步 - 将 xml 转换为 Json -
import json
import xmltodict
with open("/home/s0998hws/test.xml") as xml_file:
data_dict = xmltodict.parse(xml_file.read())
xml_file.close()
# generate the object using json.dumps()
# corresponding to json data
json_data = json.dumps(data_dict)
# Write the json data to output
# json file
with open("data.json", "w") as json_file:
json_file.write(json_data)
json_file.close()
with open('data.json') as f:
d = json.load(f)
第 2 步 - 使用 panda 标准化函数进行标准化 - 使用 xpath / 转换为 .和 [] 作为其他分隔符并构建要从 json 中提取的列,即代码将查找 /integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber 并转换为 .integrationEntityHeader.integrationTrackingNumber 并且第一个 [] 会爆炸 ,
df_1=pd.json_normalize(data=d['integration-outbound:IntegrationEntity'])
df_2=df_1[['integrationEntityHeader.integrationTrackingNumber','integrationEntityDetails.supplier.requestId','integrationEntityHeader.referenceCodeForEntity','integrationEntityDetails.supplier.id','integrationEntityDetails.supplier.forms.form']]
df_3=df_2.explode('integrationEntityDetails.supplier.forms.form')
df_3['integrationEntityDetails.supplier.forms.form.id']=df_3['integrationEntityDetails.supplier.forms.form'].apply(lambda x: x.get('id'))
df_3['integrationEntityDetails.supplier.forms.form.records']=df_3['integrationEntityDetails.supplier.forms.form'].apply(lambda x: x.get('records'))
我试图使用 csv 文件中的元数据并对其进行处理,但挑战是
df_3['integrationEntityDetails.supplier.forms.form.records.record.Internalid']=df_3['integrationEntityDetails.supplier.forms.form.records.record'].apply(lambda x: x.get('Internalid'))
因错误而失败 -
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python3.6/site-packages/pandas/core/series.py", line 3848, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2327, in pandas._libs.lib.map_infer
File "<stdin>", line 1, in <lambda>
AttributeError: 'list' object has no attribute 'get'
原因是panda dataframe的数据有list when和array,用上面的方法无法获取
下面是生成的输出
integrationEntityHeader.integrationTrackingNumber integrationEntityDetails.supplier.requestId integrationEntityHeader.referenceCodeForEntity integrationEntityDetails.supplier.id integrationEntityDetails.supplier.forms.form integrationEntityDetails.supplier.forms.form.id integrationEntityDetails.supplier.forms.form.records
0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'CATEGORY_PRODSER', 'records': {'record': [{'Internalid': '24348', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3127'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4434'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5545'}]}}, {'Internalid': '24349', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3125'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4268'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5418'}]}}, {'Internalid': '24350', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3122'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel425'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5221'}]}}]}} CATEGORY_PRODSER {'record': [{'Internalid': '24348', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3127'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4434'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5545'}]}}, {'Internalid': '24349', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3125'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4268'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5418'}]}}, {'Internalid': '24350', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3122'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel425'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5221'}]}}]}
0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'OTHER_INFOR', 'records': {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'S_EAST', 'value': 'N'}, {'id': 'W_EST', 'value': 'N'}, {'id': 'M_WEST', 'value': 'N'}, {'id': 'N_EAST', 'value': 'N'}, {'id': 'LOW_AREYOU_ASSET', 'value': '-1'}, {'id': 'LOW_SWART_PROG', 'value': '-1'}]}}}} OTHER_INFOR {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'S_EAST', 'value': 'N'}, {'id': 'W_EST', 'value': 'N'}, {'id': 'M_WEST', 'value': 'N'}, {'id': 'N_EAST', 'value': 'N'}, {'id': 'LOW_AREYOU_ASSET', 'value': '-1'}, {'id': 'LOW_SWART_PROG', 'value': '-1'}]}}}
0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'CORPORATESUSTAINABILITY', 'records': {'record': {'isDelete': 'false', 'fields': {'field': {'id': 'LOW_COD_CONDUCT', 'value': '-1'}}}}} CORPORATESUSTAINABILITY {'record': {'isDelete': 'false', 'fields': {'field': {'id': 'LOW_COD_CONDUCT', 'value': '-1'}}}}
0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'PRODUCTSERVICES', 'records': {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'LOW_SUPPLIER_TYPE', 'value': '2'}, {'id': 'LOW_DO_INT_BOTH', 'value': '1'}]}}}} PRODUCTSERVICES {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'LOW_SUPPLIER_TYPE', 'value': '2'}, {'id': 'LOW_DO_INT_BOTH', 'value': '1'}]}}}
预期输出 integrationEntityDetailsForms.csv
integrationTrackingNumber requestId referenceCodeForEntity supplier.id integrationEntityDetails.supplier.forms.form.id InternalId isDelete SupplierFormRecordFieldId SupplierFormRecordFieldValue
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE CATEGOR_LEVEL_1 MR
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_PRODSERV RES
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_2 keylevel221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_3 keylevel3127
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_4 keylevel4434
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_5 keylevel5545
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE CATEGOR_LEVEL_1 MR
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_PRODSERV RES
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_2 keylevel221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_3 keylevel3122
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_4 keylevel425
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_5 keylevel5221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE S_EAST N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE W_EST N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE M_WEST N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE N_EAST N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE LOW_AREYOU_ASSET -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CORPORATESUSTAINABILITY FALSE LOW_SWART_PROG -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CORPORATESUSTAINABILITY FALSE LOW_COD_CONDUCT -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 PRODUCTSERVICES FALSE LOW_SUPPLIER_TYPE 2
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 PRODUCTSERVICES FALSE LOW_DO_INT_BOTH 1
我认为问题中缺少这一行:
df_3['integrationEntityDetails.supplier.forms.form.records.record'] = (
df_3['integrationEntityDetails.supplier.forms.form.records'].apply(
lambda x: x.get('record')
)
)
那么,对于 Internalid,你可以这样做:
df_3['integrationEntityDetails.supplier.forms.form.records.record.Internalid'] = (
df_3['integrationEntityDetails.supplier.forms.form.records.record'].apply(
lambda x: x[0].get('Internalid') if type(x) == list else x.get('Internalid')
)
)
考虑 XSLT, the special purpose language designed to transform XML files like flattening them at certain sections. Python's third-party module, lxml,可以 运行 XSLT 1.0 脚本和 XPath 1.0 表达式。
具体来说,XSLT 可以处理您的 XPath 提取。然后,从单个转换后的结果树中,构建所需的三个数据框。为了结构良好,下面假定以下根和数据结构:
<integration-outbound:IntegrationEntity
xmlns:integration-outbound="http://example.com"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
...same content...
</integration-outbound:IntegrationEntity>
XSLT (另存为.xsl,一个特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:integration-outbound="http://example.com"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="integration-outbound:IntegrationEntity">
<data>
<xsl:apply-templates select="integrationEntityHeader/descendant::attachment"/>
<xsl:apply-templates select="integrationEntityDetails/descendant::dataProcessingInfo"/>
<xsl:apply-templates select="integrationEntityDetails/descendant::forms/descendant::field"/>
</data>
</xsl:template>
<xsl:template match="attachment">
<integrationEntityHeader>
<xsl:copy-of select="ancestor::integrationEntityHeader/*[name()!='attachments']"/>
<xsl:copy-of select="*"/>
</integrationEntityHeader>
</xsl:template>
<xsl:template match="dataProcessingInfo">
<integrationEntityDetailsControlBlock>
<xsl:copy-of select="ancestor::integration-outbound:IntegrationEntity/integrationEntityHeader/*[position() <= 2]"/>
<requestId><xsl:value-of select="ancestor::supplier/requestId"/></requestId>
<supplier_id><xsl:value-of select="ancestor::supplier/id"/></supplier_id>
<xsl:copy-of select="*"/>
</integrationEntityDetailsControlBlock>
</xsl:template>
<xsl:template match="field">
<integrationEntityDetailsForms>
<form_id><xsl:value-of select="ancestor::form/id"/></form_id>
<xsl:copy-of select="ancestor::record/*[name()!='fields']"/>
<SupplierFormRecordFieldId><xsl:value-of select="id"/></SupplierFormRecordFieldId>
<SupplierFormRecordFieldValue><xsl:value-of select="id"/></SupplierFormRecordFieldValue>
<xsl:copy-of select="ancestor::integration-outbound:IntegrationEntity/integrationEntityHeader/*[position() <= 2]"/>
<requestId><xsl:value-of select="ancestor::supplier/requestId"/></requestId>
<supplier_id><xsl:value-of select="ancestor::supplier/id"/></supplier_id>
</integrationEntityDetailsForms>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as et
import pandas as pd
# LOAD XML AND XSL
doc = et.parse('Input.xml')
style = et.parse('Script.xsl')
# INITIALIZE AND RUN TRANSFORMATION
transformer = et.XSLT(style)
flat_doc = transformer(doc)
# BUILD THREE DATA FRAMES
df_header = pd.DataFrame([{i.tag:i.text for i in el}
for el in flat_doc.xpath('integrationEntityHeader')])
df_detailsControlBlock = pd.DataFrame([{i.tag:i.text for i in el}
for el in flat_doc.xpath('integrationEntityDetailsControlBlock')])
df_detailsForms = pd.DataFrame([{i.tag:i.text for i in el}
for el in flat_doc.xpath('integrationEntityDetailsForms')])
把xml转成dict,然后写解析逻辑,因为json也是一样的。 Whosebug 非常有用,解决方案是根据所有这些链接的响应构建的。为简单起见,我创建了一个 3 层嵌套 xml。这适用于 Python3
<?xml version="1.0"?><Company><Employee><FirstName>Hal</FirstName><LastName>Thanos</LastName><ContactNo>122131</ContactNo><Email>hal.thanos@xyz.com</Email><Addresses><Address><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Iron</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>iron.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
<?xml version="1.0"?><Company><Employee><FirstName>Captain</FirstName><LastName>America</LastName><ContactNo>13322</ContactNo><Email>captain.america@xyz.com</Email><Addresses><Address><City>Trivandrum</City><State>Kerala</State><Zip>28115</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Sword</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>sword.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
<?xml version="1.0"?><Company><Employee><FirstName>Thor</FirstName><LastName>Odison</LastName><ContactNo>156565</ContactNo><Email>thor.odison@xyz.com</Email><Addresses><Address><City>Tirunelveli</City><State>TamilNadu</State><Zip>36595</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Spider</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>spider.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
<?xml version="1.0"?><Company><Employee><FirstName>Black</FirstName><LastName>Widow</LastName><ContactNo>16767</ContactNo><Email>black.widow@xyz.com</Email><Addresses><Address><City>Mysore</City><State>Karnataka</State><Zip>12478</Zip><forms><form><id>ID1</id><value>LIC</value></form></forms></Address></Addresses></Employee><Employee><FirstName>White</FirstName><LastName>Man</LastName><ContactNo>5634</ContactNo><Email>white.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
这个 xml 的配置文件是所有可能的 array/multiple level/explode 列应该被提及为 []。如代码中所述,需要 header。
根据您的商店更改变量 process_config_csv = 'config.csv' xml_file_name = 'test.xml'
XPATH,ColumName,CSV_File_Name
/Company/Employee[]/FirstName,FirstName,Name.csv
/Company/Employee[]/LastName,LastName,Name.csv
/Company/Employee[]/ContactNo,ContactNo,Name.csv
/Company/Employee[]/Email,Email,Name.csv
/Company/Employee[]/FirstName,FirstName,Address.csv
/Company/Employee[]/LastName,LastName,Address.csv
/Company/Employee[]/ContactNo,ContactNo,Address.csv
/Company/Employee[]/Email,Email,Address.csv
/Company/Employee[]/Addresses/Address[]/City,City,Address.csv
/Company/Employee[]/Addresses/Address[]/State,State,Address.csv
/Company/Employee[]/Addresses/Address[]/Zip,Zip,Address.csv
/Company/Employee[]/Addresses/Address[]/type,type,Address.csv
/Company/Employee[]/FirstName,FirstName,Form.csv
/Company/Employee[]/LastName,LastName,Form.csv
/Company/Employee[]/ContactNo,ContactNo,Form.csv
/Company/Employee[]/Email,Email,Form.csv
/Company/Employee[]/Addresses/Address[]/type,type,Form.csv
/Company/Employee[]/Addresses/Address[]/forms/form[]/id,id,Form.csv
/Company/Employee[]/Addresses/Address[]/forms/form[]/value,value,Form.csv
根据配置文件创建多个csv的代码是
import json
import xmltodict
import json
import os
import csv
import numpy as np
import pandas as pd
import sys
from collections import defaultdict
import numpy as np
def getMatches(L1, L2):
R = set()
for elm in L1:
for pat in L2:
if elm.find(pat) != -1:
if elm.find('.', len(pat)+1) != -1:
R.add(elm[:elm.find('.', len(pat)+1)])
else:
R.add(elm)
return list(R)
def xml_parse(xml_file_name):
try:
process_xml_file = xml_file_name
with open(process_xml_file) as xml_file:
for xml_string in xml_file:
"""Converting the xml to Dict"""
data_dict = xmltodict.parse(xml_string)
"""Converting the dict to Pandas DF"""
df_processing = pd.json_normalize(data_dict)
xml_parse_loop(df_processing)
xml_file.close()
except Exception as e:
s = str(e)
print(s)
def xml_parse_loop(df_processing_input):
CSV_File_Name = []
"""Getting the list of csv Files to be created"""
with open(process_config_csv, newline='') as csvfile:
DataCaptured = csv.DictReader(csvfile)
for row in DataCaptured:
if row['CSV_File_Name'] not in CSV_File_Name:
CSV_File_Name.append(row['CSV_File_Name'])
"""Iterating the list of CSV"""
for items in CSV_File_Name:
df_processing = df_processing_input
df_subset_process = []
df_subset_list_all_cols = []
df_process_sub_explode_Level = []
df_final_column_name = []
print('Parsing the xml file for creating the file - ' + str(items))
"""Fetching the field list for processs from the confic File"""
with open(process_config_csv, newline='') as csvfile:
DataCaptured = csv.DictReader(csvfile)
for row in DataCaptured:
if row['CSV_File_Name'] in items:
df_final_column_name.append(row['ColumName'])
"""Getting the columns until the first [] """
df_subset_process.append(row['XPATH'].strip('/').replace("/",".").split('[]')[0])
"""Getting the All the columnnames"""
df_subset_list_all_cols.append(row['XPATH'].strip('/').replace("/",".").replace("[]",""))
"""Getting the All the Columns to explode"""
df_process_sub_explode_Level.append(row['XPATH'].strip('/').replace('/', '.').split('[]'))
explode_ld = defaultdict(set)
"""Putting Level of explode and column names"""
for x in df_process_sub_explode_Level:
if len(x) > 1:
explode_ld[len(x) - 1].add(''.join(x[: -1]))
explode_ld = {k: list(v) for k, v in explode_ld.items()}
#print(' The All column list is for the file ' + items + " is " + str(df_subset_list_all_cols))
#print(' The first processing for the file ' + items + " is " + str(df_subset_process))
#print('The explode level of attributes for the file ' + items + " is " + str(explode_ld))
"""Remove column duplciates"""
df_subset_process = list(dict.fromkeys(df_subset_process))
for col in df_subset_process:
if col not in df_processing.columns:
df_processing[col] = np.nan
df_processing = df_processing[df_subset_process]
df_processing_col_list = df_processing.columns.tolist()
print ('The total levels to be exploded : %d' % len(explode_ld))
i=0
level=len(explode_ld)
for i in range(level):
print (' Exploding the Level : %d' % i )
df_processing_col_list = df_processing.columns.tolist()
list_of_explode=set(df_processing_col_list) & set(explode_ld[i + 1])
#print('List to expolde' + str(list_of_explode))
"""If founc in explode list exlplode some xml doesnt need to have a list it could be column handling the same"""
for c in list_of_explode:
print (' There are column present which needs to be exploded - ' + str(c))
df_processing = pd.concat((df_processing.iloc[[type(item) == list for item in df_processing[c]]].explode(c),df_processing.iloc[[type(item) != list for item in df_processing[c]]]))
print(' Finding the columns need to be fetched ')
"""From the overall column list fecthing the attributes needed to explode"""
next_level_pro_lst = getMatches(df_subset_list_all_cols,explode_ld[ i + 1 ])
#print(next_level_pro_lst)
df_processing_col_list = df_processing.columns.tolist()
for nex in next_level_pro_lst:
#print ("Fetching " + nex.rsplit('.', 1)[1] + ' from ' + nex.rsplit('.', 1)[0] + ' from ' + nex )
parent_col=nex.rsplit('.', 1)[0]
child_col=nex.rsplit('.', 1)[1]
#print(parent_col)
#print(df_processing_col_list)
if parent_col not in df_processing_col_list:
df_processing[nex.rsplit('.', 1)[0]] = ""
try:
df_processing[nex] = df_processing[parent_col].apply(lambda x: x.get(child_col))
except AttributeError:
df_processing[nex] = ""
df_processing_col_list = df_processing.columns.tolist()
if i == level-1:
print('Last Level nothing to be done')
else:
"""Extracting All columns until the next exlode column list is found"""
while len(set(df_processing_col_list) & set(explode_ld[i + 2]))==0:
next_level_pro_lst = getMatches(df_subset_list_all_cols, next_level_pro_lst)
#print(next_level_pro_lst)
for nextval in next_level_pro_lst:
if nextval not in df_processing_col_list:
#print("Fetching " + nextval.rsplit('.', 1)[1] + ' from ' + nextval.rsplit('.', 1)[0] + ' from ' + nextval)
if nextval.rsplit('.', 1)[0] not in df_processing.columns:
df_processing[nextval.rsplit('.', 1)[0]] = ""
try:
df_processing[nextval] = df_processing[nextval.rsplit('.', 1)[0]].apply(lambda x: x.get(nextval.rsplit('.', 1)[1]))
except AttributeError:
df_processing[nextval] = ""
df_processing_col_list = df_processing.columns.tolist()
df_processing = df_processing[df_subset_list_all_cols]
df_processing.columns = df_final_column_name
# if file does not exist write header
if not os.path.isfile(items):
print("The file does not exists Exists so writing new")
df_processing.to_csv('{}'.format(items), header='column_names',index=None)
else: # else it exists so append without writing the header
print("The file does exists Exists so appending")
df_processing.to_csv('{}'.format(items), mode='a', header=False,index=None)
from datetime import datetime
startTime = datetime.now().strftime("%Y%m%d_%H%M%S")
startTime = str(os.getpid()) + "_" + startTime
process_task_name = ''
process_config_csv = 'config.csv'
xml_file_name = 'test.xml'
old_print = print
def timestamped_print(*args, **kwargs):
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
printheader = now + " xml_parser " + " " + process_task_name + " - "
old_print(printheader, *args, **kwargs)
print = timestamped_print
xml_parse(xml_file_name)
创建的输出是
[, ~]$ cat Name.csv
FirstName,LastName,ContactNo,Email
Hal,Thanos,122131,hal.thanos@xyz.com
Iron,Man,12324,iron.man@xyz.com
Captain,America,13322,captain.america@xyz.com
Sword,Man,12324,sword.man@xyz.com
Thor,Odison,156565,thor.odison@xyz.com
Spider,Man,12324,spider.man@xyz.com
Black,Widow,16767,black.widow@xyz.com
White,Man,5634,white.man@xyz.com
[, ~]$ cat Address.csv
FirstName,LastName,ContactNo,Email,City,State,Zip,type
Iron,Man,12324,iron.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Iron,Man,12324,iron.man@xyz.com,Concord,NC,28027,Temporary
Hal,Thanos,122131,hal.thanos@xyz.com,Bangalore,Karnataka,560212,
Sword,Man,12324,sword.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Sword,Man,12324,sword.man@xyz.com,Concord,NC,28027,Temporary
Captain,America,13322,captain.america@xyz.com,Trivandrum,Kerala,28115,
Spider,Man,12324,spider.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Spider,Man,12324,spider.man@xyz.com,Concord,NC,28027,Temporary
Thor,Odison,156565,thor.odison@xyz.com,Tirunelveli,TamilNadu,36595,
White,Man,5634,white.man@xyz.com,Bangalore,Karnataka,560212,Permanent
White,Man,5634,white.man@xyz.com,Concord,NC,28027,Temporary
Black,Widow,16767,black.widow@xyz.com,Mysore,Karnataka,12478,
[, ~]$ cat Form.csv
FirstName,LastName,ContactNo,Email,type,id,value
Iron,Man,12324,iron.man@xyz.com,Temporary,ID1,LIC
Iron,Man,12324,iron.man@xyz.com,Temporary,ID2,PAS
Iron,Man,12324,iron.man@xyz.com,Temporary,ID3,SSN
Iron,Man,12324,iron.man@xyz.com,Temporary,ID2,CC
Hal,Thanos,122131,hal.thanos@xyz.com,,ID1,LIC
Hal,Thanos,122131,hal.thanos@xyz.com,,ID2,PAS
Iron,Man,12324,iron.man@xyz.com,Permanent,ID3,LIC
Sword,Man,12324,sword.man@xyz.com,Temporary,ID1,LIC
Sword,Man,12324,sword.man@xyz.com,Temporary,ID2,PAS
Sword,Man,12324,sword.man@xyz.com,Temporary,ID3,SSN
Sword,Man,12324,sword.man@xyz.com,Temporary,ID2,CC
Captain,America,13322,captain.america@xyz.com,,ID1,LIC
Captain,America,13322,captain.america@xyz.com,,ID2,PAS
Sword,Man,12324,sword.man@xyz.com,Permanent,ID3,LIC
Spider,Man,12324,spider.man@xyz.com,Temporary,ID1,LIC
Spider,Man,12324,spider.man@xyz.com,Temporary,ID2,PAS
Spider,Man,12324,spider.man@xyz.com,Temporary,ID3,SSN
Spider,Man,12324,spider.man@xyz.com,Temporary,ID2,CC
Thor,Odison,156565,thor.odison@xyz.com,,ID1,LIC
Thor,Odison,156565,thor.odison@xyz.com,,ID2,PAS
Spider,Man,12324,spider.man@xyz.com,Permanent,ID3,LIC
White,Man,5634,white.man@xyz.com,Temporary,ID1,LIC
White,Man,5634,white.man@xyz.com,Temporary,ID2,PAS
White,Man,5634,white.man@xyz.com,Temporary,ID3,SSN
White,Man,5634,white.man@xyz.com,Temporary,ID2,CC
White,Man,5634,white.man@xyz.com,Permanent,ID3,LIC
Black,Widow,16767,black.widow@xyz.com,,ID1,LIC
这些片段和答案是从不同的线程中提取的,感谢 @Mark Tolonen @Mandy007 @deadshot
Create a dict of list using python from csv
https://whosebug.com/questions/62837949/extract-a-list-from-a-list
How to explode Panda column with data having different dict and list of dict
这绝对可以变得更短、性能更好,并且可以进一步增强
Spark 来救援!
以下代码是在 Scala 中编写的,但如果您愿意,可以轻松将其转换为 Python。
Databrick's XML library 使 XML 处理变得容易。
val headers = spark.read.format("xml").option("rowTag", "integrationEntityHeader").load("WhosebugRafaXML.xml")
headers.write.csv(<headerFilename>) // Create CSV from the header file
val details = spark.read.format("xml").option("rowTag", "integrationEntityDetails").load("WhosebugRafaXML.xml")
// The details need further unnesting. To get suppliers, for instance, you can do
val supplier = spark.read.format("xml").option("rowTag", "supplier").load("WhosebugRafaXML.xml")
supplier.show
+--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+
| allLocations| bankDetails| companyDetails| contactDetails| controlBlock|facilityCode| forms| id| myLocation|requestId|status|supplierType|systemCode|systemFacilityDetails|
+--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+
|[[HQ, 2501 GRANT ...|[[[[LOW_BANK_KEY,...|[No, SUPPLIER, 25...|[[[1704312142, SI...|[[[MODE, Onboardi...| 1|[[[CATEGORY_PRODS...|1647059|[[1704342, false,...| 2614352|ACTIVE| Operational| 1| [[ACTIVE, 1, 1]]|
+--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+
xml 的格式对我来说有点陌生。
你试过了吗pandas_read_xml?
pip install pandas_read_xml
你可以这样做
import pandas_read_xml as pdx
df = pdx.read_xml('filename.xml')
要展平,您可以
df = pdx.flatten(df)
或
df = pdx.fully_flatten(df)