获取 parent、child 及其 child 的文本

Question

<avis>
<numeroseao>1331795</numeroseao>
<numero>61628-3435560</numero>
<organisme>Ville de Québec</organisme>
<fournisseurs>
  <fournisseur>
    <nomorganisation>APEL ASSOCIATION POUT DU LA MARAISNORD</nomorganisation>
    <adjudicataire>1</adjudicataire>
    <montantsoumis>0.000000</montantsoumis>
    <montantssoumisunite>0</montantssoumisunite>
    <montantcontrat>89732.240000</montantcontrat>
    <montanttotalcontrat>0.000000</montanttotalcontrat>
  </fournisseur>
</fournisseurs>
</avis>

所以有 avis，avis 有 fournisseurs，fournisseurs 有更多的节点。如何将这些值获取到数据框？

我正在使用下面的代码

element_tree = ET.parse('D:\python_script\temp2\AvisRevisions_20200201_20200229.xml')
root = element_tree.getroot()
for child in root.findall('.//avis/*/*/*'):

或

for child in root.findall('.//avis/*'):

但它只会让我得到 parent 个节点或 child 个节点，而不是全部。

Answer 1

由于您的数据不平坦，因此当您将 xml 直接导入 pandas 时可能会出现问题。在这种情况下，像 pandas_read_xml 这样的库可能会有用：

import pandas_read_xml as pdx

df = pdx.read_xml(xml)
df = pdx.fully_flatten(df)  # this should get you the structure you want

在上面的行中，xml 变量是您的“AvisRevisions_20200201_20200229.xml”文件。

对于更扁平的结构，您可以这样使用 Pandas：

import pandas as pd

df = pd.read_xml(xml, xpath="//fournisseurs")

如果您要查找整个“avis”部分，可以将 xpath 参数替换为：

df = pd.read_xml(xml, xpath="//avis")

据此，pandas 应该创建具有适当列的数据框。这里是 link 到 Pandas docs.

Answer 2

试试下面的方法

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<avis>
<numeroseao>1331795</numeroseao>
<numero>61628-3435560</numero>
<organisme>Ville de Québec</organisme>
<fournisseurs>
  <fournisseur>
    <nomorganisation>APEL ASSOCIATION POUT DU LA MARAISNORD</nomorganisation>
    <adjudicataire>1</adjudicataire>
    <montantsoumis>0.000000</montantsoumis>
    <montantssoumisunite>0</montantssoumisunite>
    <montantcontrat>89732.240000</montantcontrat>
    <montanttotalcontrat>0.000000</montanttotalcontrat>
  </fournisseur>
</fournisseurs>
</avis>'''
root = ET.fromstring(xml)

data = []
fournisseur = root.find('.//fournisseur')
data.append({e.tag:e.text for e in fournisseur})
df = pd.DataFrame(data)

获取 parent、child 及其 child 的文本

Getting text of parent, child and their child

python

xml

elementtree

python-3.x