Python 将 XML 解析为缺少元素的 DataFrame
Python Parsing XML to DataFrame with absent Elements
我想解析一个 xml 文件,但缺少某些员工的某些元素。在下面的示例中,并非所有员工都有就业数据。
这是一个示例文件:
<Employees
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.portalvs.sk/regzam/employees.xsd"
generated="2020-09-14">
<Employee Id="87912190">
<GivenName>John</GivenName>
<FamilyName>Test1</FamilyName>
</Employee>
<Employee Id="40286072">
<GivenName>Josh</GivenName>
<FamilyName>Test2</FamilyName>
</Employee>
<Employee Id="79273304">
<GivenName>Mark</GivenName>
<FamilyName>Test3</FamilyName>
</Employee>
<Employee Id="62028843">
<GivenName>Cath</GivenName>
<FamilyName>Test4</FamilyName>
<Employment>
<Workplace Code="995000000">
<University Code="995000000">UniversityTest</University>
</Workplace>
<BeginDate>2013-11-01</BeginDate>
</Employment>
</Employee>
<Employee Id="24030368">
<GivenName>Becky</GivenName>
<FamilyName>Test5</FamilyName>
<Employment>
<Workplace Code="998000000">
<University Code="998000000">UniversityTest2</University>
</Workplace>
<BeginDate>2008-09-01</BeginDate>
</Employment>
</Employee>
</Employees>
我想用 employee_id、employee_first_name、employee_last_name、university_code 和 begin_date 创建一个 DataFrame。对于那些没有就业数据的员工,我希望大学价值观缺失。
employee_id employee_first_name employee_last_name university_code begin_date
87912190 John Test1 NaN NaN
40286072 Josh Test2 NaN NaN
79273304 Mark Test3 NaN NaN
62028843 Cath Test4 995000000 2013-11-01
24030368 Becky Test5 998000000 2008-09-01
感谢您的帮助,因为我对 Python 比较陌生,对 xml 解析完全陌生。
您可以使用 beautifulsoup
来解析 XML。例如,其他选项是 lxml
。
如果 txt
包含您的 XML 代码段,则此代码:
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
all_data = []
for e in soup.select('employee'):
all_data.append({
'employee_id': e['id'],
'employee_first_name': e.givenname.text,
'employee_last_name': e.familyname.text,
'university_code': e.university['code'] if e.university else None,
'begin_date': e.begindate.text if e.begindate else None
})
df = pd.DataFrame(all_data)
print(df)
创建此数据框:
employee_id employee_first_name employee_last_name university_code begin_date
0 87912190 John Test1 None None
1 40286072 Josh Test2 None None
2 79273304 Mark Test3 None None
3 62028843 Cath Test4 995000000 2013-11-01
4 24030368 Becky Test5 998000000 2008-09-01
我想解析一个 xml 文件,但缺少某些员工的某些元素。在下面的示例中,并非所有员工都有就业数据。
这是一个示例文件:
<Employees
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.portalvs.sk/regzam/employees.xsd"
generated="2020-09-14">
<Employee Id="87912190">
<GivenName>John</GivenName>
<FamilyName>Test1</FamilyName>
</Employee>
<Employee Id="40286072">
<GivenName>Josh</GivenName>
<FamilyName>Test2</FamilyName>
</Employee>
<Employee Id="79273304">
<GivenName>Mark</GivenName>
<FamilyName>Test3</FamilyName>
</Employee>
<Employee Id="62028843">
<GivenName>Cath</GivenName>
<FamilyName>Test4</FamilyName>
<Employment>
<Workplace Code="995000000">
<University Code="995000000">UniversityTest</University>
</Workplace>
<BeginDate>2013-11-01</BeginDate>
</Employment>
</Employee>
<Employee Id="24030368">
<GivenName>Becky</GivenName>
<FamilyName>Test5</FamilyName>
<Employment>
<Workplace Code="998000000">
<University Code="998000000">UniversityTest2</University>
</Workplace>
<BeginDate>2008-09-01</BeginDate>
</Employment>
</Employee>
</Employees>
我想用 employee_id、employee_first_name、employee_last_name、university_code 和 begin_date 创建一个 DataFrame。对于那些没有就业数据的员工,我希望大学价值观缺失。
employee_id employee_first_name employee_last_name university_code begin_date
87912190 John Test1 NaN NaN
40286072 Josh Test2 NaN NaN
79273304 Mark Test3 NaN NaN
62028843 Cath Test4 995000000 2013-11-01
24030368 Becky Test5 998000000 2008-09-01
感谢您的帮助,因为我对 Python 比较陌生,对 xml 解析完全陌生。
您可以使用 beautifulsoup
来解析 XML。例如,其他选项是 lxml
。
如果 txt
包含您的 XML 代码段,则此代码:
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
all_data = []
for e in soup.select('employee'):
all_data.append({
'employee_id': e['id'],
'employee_first_name': e.givenname.text,
'employee_last_name': e.familyname.text,
'university_code': e.university['code'] if e.university else None,
'begin_date': e.begindate.text if e.begindate else None
})
df = pd.DataFrame(all_data)
print(df)
创建此数据框:
employee_id employee_first_name employee_last_name university_code begin_date
0 87912190 John Test1 None None
1 40286072 Josh Test2 None None
2 79273304 Mark Test3 None None
3 62028843 Cath Test4 995000000 2013-11-01
4 24030368 Becky Test5 998000000 2008-09-01