如何使用python提取以兄弟节点信息为条件的节点信息?
How to extract node information conditional to the information of a sibling node using python?
我有一个personId
感兴趣的列表:
agents = {'id': ['20','32','12']}
那我有一个XML户型的档案:
<households
<household id="980921">
<members>
<personId refId="5"/>
<personId refId="15"/>
<personId refId="20"/>
</members>
<income currency="CHF" period="month">
8000.0
</income>
<attributes>
<attribute name="numberOfCars" class="java.lang.String" >2</attribute>
</attributes>
</household>
<household id="980976">
<members>
<personId refId="2891"/>
<personId refId="100"/>
<personId refId="2044"/>
</members>
<income currency="CHF" period="month">
8000.0
</income>
<attributes>
<attribute name="numberOfCars" class="java.lang.String" >1</attribute>
</attributes>
</household>
<household id="980983">
<members>
<personId refId="11110"/>
<personId refId="32"/>
<personId refId="34"/>
</members>
<income currency="CHF" period="month">
10000.0
</income>
<attributes>
<attribute name="numberOfCars" class="java.lang.String" >0</attribute>
</attributes>
</household>
</households>
我想要的是有一个数据框,显示 income
个家庭,哪个 member
属于 agents
的列表兴趣。像这样的东西(加号是一个额外的列,表示有兴趣的人居住的家庭成员人数):
personId income
20 8000.0
32 10000.0
我的方法并没有走得太远。我很难过滤 members
然后从 "sibling" 节点访问信息。我的输出是一个空数据框。
import xml.etree.ElementTree as ET
import pandas as pd
with open(xml) as fd:
root = ET.parse(fd).getroot()
xpath_fmt = 'household/members/personId[@refId="{}"]/income'
rows = []
for pid in agents['id']:
xpath = xpath_fmt.format(pid)
r = root.findall(xpath)
for res in r:
rows.append([pid, res.text])
d = pd.DataFrame(rows, columns=['personId', 'income'])
非常感谢您的帮助!
如评论中所述,这是使用 BeautifulSoup 的解决方案(xml_txt
是问题中的 XML 文本):
import pandas as pd
from bs4 import BeautifulSoup
agents = {'id': ['20','32','12']}
soup = BeautifulSoup(xml_txt, 'xml') #xml_txt is your XML text from the question
css_selector = ','.join('household > members > personId[refId="{}"]'.format(i) for i in agents['id'])
data = {'personId':[], 'income':[]}
for person in soup.select(css_selector):
data['personId'].append( person['refId'] )
data['income'].append( person.find_parent('household').find('income').get_text(strip=True) )
df = pd.DataFrame(data)
print(df)
打印:
personId income
0 20 8000.0
1 32 10000.0
我有一个personId
感兴趣的列表:
agents = {'id': ['20','32','12']}
那我有一个XML户型的档案:
<households
<household id="980921">
<members>
<personId refId="5"/>
<personId refId="15"/>
<personId refId="20"/>
</members>
<income currency="CHF" period="month">
8000.0
</income>
<attributes>
<attribute name="numberOfCars" class="java.lang.String" >2</attribute>
</attributes>
</household>
<household id="980976">
<members>
<personId refId="2891"/>
<personId refId="100"/>
<personId refId="2044"/>
</members>
<income currency="CHF" period="month">
8000.0
</income>
<attributes>
<attribute name="numberOfCars" class="java.lang.String" >1</attribute>
</attributes>
</household>
<household id="980983">
<members>
<personId refId="11110"/>
<personId refId="32"/>
<personId refId="34"/>
</members>
<income currency="CHF" period="month">
10000.0
</income>
<attributes>
<attribute name="numberOfCars" class="java.lang.String" >0</attribute>
</attributes>
</household>
</households>
我想要的是有一个数据框,显示 income
个家庭,哪个 member
属于 agents
的列表兴趣。像这样的东西(加号是一个额外的列,表示有兴趣的人居住的家庭成员人数):
personId income
20 8000.0
32 10000.0
我的方法并没有走得太远。我很难过滤 members
然后从 "sibling" 节点访问信息。我的输出是一个空数据框。
import xml.etree.ElementTree as ET
import pandas as pd
with open(xml) as fd:
root = ET.parse(fd).getroot()
xpath_fmt = 'household/members/personId[@refId="{}"]/income'
rows = []
for pid in agents['id']:
xpath = xpath_fmt.format(pid)
r = root.findall(xpath)
for res in r:
rows.append([pid, res.text])
d = pd.DataFrame(rows, columns=['personId', 'income'])
非常感谢您的帮助!
如评论中所述,这是使用 BeautifulSoup 的解决方案(xml_txt
是问题中的 XML 文本):
import pandas as pd
from bs4 import BeautifulSoup
agents = {'id': ['20','32','12']}
soup = BeautifulSoup(xml_txt, 'xml') #xml_txt is your XML text from the question
css_selector = ','.join('household > members > personId[refId="{}"]'.format(i) for i in agents['id'])
data = {'personId':[], 'income':[]}
for person in soup.select(css_selector):
data['personId'].append( person['refId'] )
data['income'].append( person.find_parent('household').find('income').get_text(strip=True) )
df = pd.DataFrame(data)
print(df)
打印:
personId income
0 20 8000.0
1 32 10000.0