用 Python lxml 解析 XML
Parse XML with Python lxml
我正在尝试使用 python 库 lxml 解析 XML,并希望结果输出在一个数据框。我对 python 和解析比较陌生,所以请耐心等待我概述问题。我正在尝试解析的原始 xml 可用 here
我有兴趣获得在“invstOrSec”中找到的一些相关标签。下面是一个“invstOrSec”实例的快照,其中标签“curCd”附带的文本是美元。
<?xml version="1.0" encoding="UTF-8"?>
<invstOrSec>
<name>NIPPON LIFE INSURANCE</name>
<lei>549300Y0HHMFW3EVWY08</lei>
<curCd>USD</curCd>
<invstOrSec>
这相对简单,我目前的方法是先在字典中定义相关标签,然后在循环中将它们粗化到数据帧中。
import pandas as pd
from lxml import etree
# Declare directory
os.chdir('C:/Users/A1610222/Desktop/Form NPORT/pkg/sec-edgar-filings/0001548717/NPORT-P/0001752724-
20-040624')
# Set root
xmlfile = "filing-details.xml"
tree = etree.parse(xmlfile)
root = tree.getroot()
# Remove namespace prefixes
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
# Remove unused namespace declarations
etree.cleanup_namespaces(root)
# Set path
invstOrSec = root.xpath('//invstOrSec')
# Define tags to extract
vars = {'invstOrSec' : {'name', 'lei', 'curCd'}
# Extract holdings data
sec_info = pd.DataFrame()
temp = pd.DataFrame()
for one in invstOrSec:
for two in one:
if two.tag in vars['invstOrSec']:
temp[two.tag] = [two.text]
sec_info = sec_info.append(temp)
这是sec_info
的前三行
name
lei
curCd
NIPPON LIFE INSURANCE
549300Y0HHMFW3EVWY08
USD
Lloyds Banking Group PLC
549300PPXHEU2JF0AM85
USD
Enbridge Inc
98TPTUM4IVMFCZBCUR27
USD
但是,当货币不是美元时,xml 遵循的结构略有不同。请参阅以下示例。
<?xml version="1.0" encoding="UTF-8"?>
<invstOrSec>
<name>ACHMEA BV</name>
<lei>7245007QUMI1FHIQV531</lei>
<currencyConditional curCd="EUR" exchangeRt="0.89150400"/>
<invstOrSec>
这次 curCd 被不同的标签 currencyConditional 替换,它包含与文本相反的属性。我很难解释这些情况,同时让我的代码尽可能通用。我希望我已经设法说明了这个问题。再次,如果这太初级了,请原谅。任何帮助将不胜感激。
这是一个你不应该重新发明轮子的案例;使用其他人开发的工具...
import pandas as pd
import pandas_read_xml as pdx
url = 'https://www.sec.gov/Archives/edgar/data/1548717/000175272420040624/primary_doc.xml'
df = pdx.read_xml(url,['edgarSubmission', 'formData', 'invstOrSecs','invstOrSec'])
#because of the non-US currency column, you have to apply one more contortion:
df['currencyConditional'] = df['currencyConditional'].apply(lambda x: x.get('@curCd') if not isinstance(x,float) else "NA" )
df[['name','lei','curCd','currencyConditional']]
输出(部分,显然)- 注意最后一行:
168 BNP PARIBAS R0MUWSFPU8MPRO8K5P83 USD NA
169 Societe Generale O2RNE8IBXP4R0TD8PU41 USD NA
170 BARCLAYS PLC 213800LBQA1Y9L22JB70 NaN GBP
我正在尝试使用 python 库 lxml 解析 XML,并希望结果输出在一个数据框。我对 python 和解析比较陌生,所以请耐心等待我概述问题。我正在尝试解析的原始 xml 可用 here
我有兴趣获得在“invstOrSec”中找到的一些相关标签。下面是一个“invstOrSec”实例的快照,其中标签“curCd”附带的文本是美元。
<?xml version="1.0" encoding="UTF-8"?>
<invstOrSec>
<name>NIPPON LIFE INSURANCE</name>
<lei>549300Y0HHMFW3EVWY08</lei>
<curCd>USD</curCd>
<invstOrSec>
这相对简单,我目前的方法是先在字典中定义相关标签,然后在循环中将它们粗化到数据帧中。
import pandas as pd
from lxml import etree
# Declare directory
os.chdir('C:/Users/A1610222/Desktop/Form NPORT/pkg/sec-edgar-filings/0001548717/NPORT-P/0001752724-
20-040624')
# Set root
xmlfile = "filing-details.xml"
tree = etree.parse(xmlfile)
root = tree.getroot()
# Remove namespace prefixes
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
# Remove unused namespace declarations
etree.cleanup_namespaces(root)
# Set path
invstOrSec = root.xpath('//invstOrSec')
# Define tags to extract
vars = {'invstOrSec' : {'name', 'lei', 'curCd'}
# Extract holdings data
sec_info = pd.DataFrame()
temp = pd.DataFrame()
for one in invstOrSec:
for two in one:
if two.tag in vars['invstOrSec']:
temp[two.tag] = [two.text]
sec_info = sec_info.append(temp)
这是sec_info
的前三行name | lei | curCd |
---|---|---|
NIPPON LIFE INSURANCE | 549300Y0HHMFW3EVWY08 | USD |
Lloyds Banking Group PLC | 549300PPXHEU2JF0AM85 | USD |
Enbridge Inc | 98TPTUM4IVMFCZBCUR27 | USD |
但是,当货币不是美元时,xml 遵循的结构略有不同。请参阅以下示例。
<?xml version="1.0" encoding="UTF-8"?>
<invstOrSec>
<name>ACHMEA BV</name>
<lei>7245007QUMI1FHIQV531</lei>
<currencyConditional curCd="EUR" exchangeRt="0.89150400"/>
<invstOrSec>
这次 curCd 被不同的标签 currencyConditional 替换,它包含与文本相反的属性。我很难解释这些情况,同时让我的代码尽可能通用。我希望我已经设法说明了这个问题。再次,如果这太初级了,请原谅。任何帮助将不胜感激。
这是一个你不应该重新发明轮子的案例;使用其他人开发的工具...
import pandas as pd
import pandas_read_xml as pdx
url = 'https://www.sec.gov/Archives/edgar/data/1548717/000175272420040624/primary_doc.xml'
df = pdx.read_xml(url,['edgarSubmission', 'formData', 'invstOrSecs','invstOrSec'])
#because of the non-US currency column, you have to apply one more contortion:
df['currencyConditional'] = df['currencyConditional'].apply(lambda x: x.get('@curCd') if not isinstance(x,float) else "NA" )
df[['name','lei','curCd','currencyConditional']]
输出(部分,显然)- 注意最后一行:
168 BNP PARIBAS R0MUWSFPU8MPRO8K5P83 USD NA
169 Societe Generale O2RNE8IBXP4R0TD8PU41 USD NA
170 BARCLAYS PLC 213800LBQA1Y9L22JB70 NaN GBP