复杂的 XML 解析为 CSV (Python 3.x) (Interactive Broker)
Complicated XML parsing to CSV (Python 3.x) (Interactive Broker)
除了简单的 XML 解析如下 link:
parsing interactive broker fundamental data
我在 XML 解析中遇到了更多困难的情况:
两个主要错误:
字符串索引必须是整数
列表索引必须是整数或切片,而不是str
XML =
<ReportSnapshot Major="1" Minor="0" Revision="1">
<CoIDs>
<CoID Type="RepNo">AC317</CoID>
<CoID Type="CompanyName">HSBC Holdings plc (Hong Kong)</CoID>
</CoIDs>
<Issues>
<Issue ID="1" Type="C" Desc="Common Stock" Order="1">
<IssueID Type="Name">Ordinary Shares</IssueID>
<IssueID Type="Ticker">5</IssueID>
<IssueID Type="CUSIP">G4634U169</IssueID>
<IssueID Type="ISIN">GB0005405286</IssueID>
<IssueID Type="RIC">0005.HK</IssueID>
<IssueID Type="SEDOL">6158163</IssueID>
<IssueID Type="DisplayRIC">0005.HK</IssueID>
<IssueID Type="InstrumentPI">312270</IssueID>
<IssueID Type="QuotePI">1049324</IssueID>
<Exchange Code="HKG" Country="HKG">Hong Kong Stock Exchange</Exchange>
<MostRecentSplit Date="2009-03-12">1.14753</MostRecentSplit>
</Issue>
</Issues>
<CoGeneralInfo>
<CoStatus Code="1">Active</CoStatus>
<CoType Code="EQU">Equity Issue</CoType>
<LastModified>2018-07-20</LastModified>
<LatestAvailableAnnual>2017-12-31</LatestAvailableAnnual>
<LatestAvailableInterim>2018-03-31</LatestAvailableInterim>
<Employees LastUpdated="2018-03-31">228899</Employees>
<SharesOut Date="2018-07-25" TotalFloat="19880413090.0">19949959451.0</SharesOut>
<ReportingCurrency Code="USD">U.S. Dollars</ReportingCurrency>
<MostRecentExchange Date="2018-07-25">1.0</MostRecentExchange>
</CoGeneralInfo>
<peerInfo lastUpdated="2018-07-20T09:20:26">
<IndustryInfo>
<Industry type="TRBC" order="1" reported="0" code="5510101010" mnem="">Banks - NEC</Industry>
<Industry type="NAICS" order="1" reported="0" code="52211" mnem="">Commercial Banking</Industry>
<Industry type="NAICS" order="2" reported="0" code="52393" mnem="">Investment Advice</Industry>
<Industry type="NAICS" order="3" reported="0" code="52392" mnem="">Portfolio Management</Industry>
<Industry type="SIC" order="0" reported="1" code="6035" mnem="">Federal Savings Institutions</Industry>
<Industry type="SIC" order="1" reported="0" code="6029" mnem="">Commercial Banks, Nec</Industry>
<Industry type="SIC" order="2" reported="0" code="6282" mnem="">Investment Advice</Industry>
</IndustryInfo>
</peerInfo>
<Ratios PriceCurrency="HKD" ReportingCurrency="USD" ExchangeRate="7.84530" LatestAvailableDate="2017-12-31">
<Group ID="Price and Volume">
<Ratio FieldName="NPRICE" Type="N">74.75000</Ratio>
<Ratio FieldName="NHIG" Type="N">86.00000</Ratio>
<Ratio FieldName="NLOW" Type="N">71.45000</Ratio>
<Ratio FieldName="PDATE" Type="D">2018-07-26T00:00:00</Ratio>
<Ratio FieldName="VOL10DAVG" Type="N">12.85415</Ratio>
<Ratio FieldName="EV" Type="N">2455297.00000</Ratio>
</Group>
<Group ID="Income Statement">
<Ratio FieldName="MKTCAP" Type="N">1493871.00000</Ratio>
<Ratio FieldName="AREV" Type="N">321618.10000</Ratio>
<Ratio FieldName="AEBITD" Type="N">177727.40000</Ratio>
<Ratio FieldName="ANIAC" Type="N">86070.79000</Ratio>
</Group>
</Ratios>
</ReportSnapshot>
我想将此信息转换为 CSV 格式,格式如下:
CompanyName Ticker Industry type="TRBC" Industry type="NAICS" LastModified ReportingCurrency NPRICE MKTCAP
HSBC Holdings plc (Hong Kong) 5 Banks - NEC Commercial Banking 2018-07-20 USD 74.75000 1493871.00000
用于写入 CSV 文件 Python 具有内置 csv
模块。对于解析 XML 文件,我推荐 BeautifulSoup
- 这样问题就变得容易了:
xml_data = """<ReportSnapshot Major="1" Minor="0" Revision="1">
<CoIDs>
<CoID Type="RepNo">AC317</CoID>
<CoID Type="CompanyName">HSBC Holdings plc (Hong Kong)</CoID>
</CoIDs>
<Issues>
<Issue ID="1" Type="C" Desc="Common Stock" Order="1">
<IssueID Type="Name">Ordinary Shares</IssueID>
<IssueID Type="Ticker">5</IssueID>
<IssueID Type="CUSIP">G4634U169</IssueID>
<IssueID Type="ISIN">GB0005405286</IssueID>
<IssueID Type="RIC">0005.HK</IssueID>
<IssueID Type="SEDOL">6158163</IssueID>
<IssueID Type="DisplayRIC">0005.HK</IssueID>
<IssueID Type="InstrumentPI">312270</IssueID>
<IssueID Type="QuotePI">1049324</IssueID>
<Exchange Code="HKG" Country="HKG">Hong Kong Stock Exchange</Exchange>
<MostRecentSplit Date="2009-03-12">1.14753</MostRecentSplit>
</Issue>
</Issues>
<CoGeneralInfo>
<CoStatus Code="1">Active</CoStatus>
<CoType Code="EQU">Equity Issue</CoType>
<LastModified>2018-07-20</LastModified>
<LatestAvailableAnnual>2017-12-31</LatestAvailableAnnual>
<LatestAvailableInterim>2018-03-31</LatestAvailableInterim>
<Employees LastUpdated="2018-03-31">228899</Employees>
<SharesOut Date="2018-07-25" TotalFloat="19880413090.0">19949959451.0</SharesOut>
<ReportingCurrency Code="USD">U.S. Dollars</ReportingCurrency>
<MostRecentExchange Date="2018-07-25">1.0</MostRecentExchange>
</CoGeneralInfo>
<peerInfo lastUpdated="2018-07-20T09:20:26">
<IndustryInfo>
<Industry type="TRBC" order="1" reported="0" code="5510101010" mnem="">Banks - NEC</Industry>
<Industry type="NAICS" order="1" reported="0" code="52211" mnem="">Commercial Banking</Industry>
<Industry type="NAICS" order="2" reported="0" code="52393" mnem="">Investment Advice</Industry>
<Industry type="NAICS" order="3" reported="0" code="52392" mnem="">Portfolio Management</Industry>
<Industry type="SIC" order="0" reported="1" code="6035" mnem="">Federal Savings Institutions</Industry>
<Industry type="SIC" order="1" reported="0" code="6029" mnem="">Commercial Banks, Nec</Industry>
<Industry type="SIC" order="2" reported="0" code="6282" mnem="">Investment Advice</Industry>
</IndustryInfo>
</peerInfo>
<Ratios PriceCurrency="HKD" ReportingCurrency="USD" ExchangeRate="7.84530" LatestAvailableDate="2017-12-31">
<Group ID="Price and Volume">
<Ratio FieldName="NPRICE" Type="N">74.75000</Ratio>
<Ratio FieldName="NHIG" Type="N">86.00000</Ratio>
<Ratio FieldName="NLOW" Type="N">71.45000</Ratio>
<Ratio FieldName="PDATE" Type="D">2018-07-26T00:00:00</Ratio>
<Ratio FieldName="VOL10DAVG" Type="N">12.85415</Ratio>
<Ratio FieldName="EV" Type="N">2455297.00000</Ratio>
</Group>
<Group ID="Income Statement">
<Ratio FieldName="MKTCAP" Type="N">1493871.00000</Ratio>
<Ratio FieldName="AREV" Type="N">321618.10000</Ratio>
<Ratio FieldName="AEBITD" Type="N">177727.40000</Ratio>
<Ratio FieldName="ANIAC" Type="N">86070.79000</Ratio>
</Group>
</Ratios>
</ReportSnapshot>"""
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(xml_data, 'xml')
headers = ['CompanyName',
'Ticker',
'Industry type="TRBC"',
'Industry type="NAICS"',
'LastModified',
'ReportingCurrency',
'NPRICE',
'MKTCAP']
with open('data.csv', 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
csvwriter.writerow(headers)
row = []
row.append(soup.select_one('CoID[Type="CompanyName"]').text)
row.append(soup.select_one('IssueID[Type="Ticker"]').text)
row.append(soup.select_one('Industry[type="TRBC"]').text)
row.append(soup.select_one('Industry[type="NAICS"]').text)
row.append(soup.select_one('LastModified').text)
row.append(soup.select_one('ReportingCurrency[Code]')['Code'])
row.append(soup.select_one('Ratio[FieldName="NPRICE"]').text)
row.append(soup.select_one('Ratio[FieldName="MKTCAP]"').text)
csvwriter.writerow(row)
结果在 data.csv 文件中(来自 LibreOffice 的屏幕截图):
除了简单的 XML 解析如下 link:
parsing interactive broker fundamental data
我在 XML 解析中遇到了更多困难的情况:
两个主要错误:
字符串索引必须是整数
列表索引必须是整数或切片,而不是str
XML =
<ReportSnapshot Major="1" Minor="0" Revision="1">
<CoIDs>
<CoID Type="RepNo">AC317</CoID>
<CoID Type="CompanyName">HSBC Holdings plc (Hong Kong)</CoID>
</CoIDs>
<Issues>
<Issue ID="1" Type="C" Desc="Common Stock" Order="1">
<IssueID Type="Name">Ordinary Shares</IssueID>
<IssueID Type="Ticker">5</IssueID>
<IssueID Type="CUSIP">G4634U169</IssueID>
<IssueID Type="ISIN">GB0005405286</IssueID>
<IssueID Type="RIC">0005.HK</IssueID>
<IssueID Type="SEDOL">6158163</IssueID>
<IssueID Type="DisplayRIC">0005.HK</IssueID>
<IssueID Type="InstrumentPI">312270</IssueID>
<IssueID Type="QuotePI">1049324</IssueID>
<Exchange Code="HKG" Country="HKG">Hong Kong Stock Exchange</Exchange>
<MostRecentSplit Date="2009-03-12">1.14753</MostRecentSplit>
</Issue>
</Issues>
<CoGeneralInfo>
<CoStatus Code="1">Active</CoStatus>
<CoType Code="EQU">Equity Issue</CoType>
<LastModified>2018-07-20</LastModified>
<LatestAvailableAnnual>2017-12-31</LatestAvailableAnnual>
<LatestAvailableInterim>2018-03-31</LatestAvailableInterim>
<Employees LastUpdated="2018-03-31">228899</Employees>
<SharesOut Date="2018-07-25" TotalFloat="19880413090.0">19949959451.0</SharesOut>
<ReportingCurrency Code="USD">U.S. Dollars</ReportingCurrency>
<MostRecentExchange Date="2018-07-25">1.0</MostRecentExchange>
</CoGeneralInfo>
<peerInfo lastUpdated="2018-07-20T09:20:26">
<IndustryInfo>
<Industry type="TRBC" order="1" reported="0" code="5510101010" mnem="">Banks - NEC</Industry>
<Industry type="NAICS" order="1" reported="0" code="52211" mnem="">Commercial Banking</Industry>
<Industry type="NAICS" order="2" reported="0" code="52393" mnem="">Investment Advice</Industry>
<Industry type="NAICS" order="3" reported="0" code="52392" mnem="">Portfolio Management</Industry>
<Industry type="SIC" order="0" reported="1" code="6035" mnem="">Federal Savings Institutions</Industry>
<Industry type="SIC" order="1" reported="0" code="6029" mnem="">Commercial Banks, Nec</Industry>
<Industry type="SIC" order="2" reported="0" code="6282" mnem="">Investment Advice</Industry>
</IndustryInfo>
</peerInfo>
<Ratios PriceCurrency="HKD" ReportingCurrency="USD" ExchangeRate="7.84530" LatestAvailableDate="2017-12-31">
<Group ID="Price and Volume">
<Ratio FieldName="NPRICE" Type="N">74.75000</Ratio>
<Ratio FieldName="NHIG" Type="N">86.00000</Ratio>
<Ratio FieldName="NLOW" Type="N">71.45000</Ratio>
<Ratio FieldName="PDATE" Type="D">2018-07-26T00:00:00</Ratio>
<Ratio FieldName="VOL10DAVG" Type="N">12.85415</Ratio>
<Ratio FieldName="EV" Type="N">2455297.00000</Ratio>
</Group>
<Group ID="Income Statement">
<Ratio FieldName="MKTCAP" Type="N">1493871.00000</Ratio>
<Ratio FieldName="AREV" Type="N">321618.10000</Ratio>
<Ratio FieldName="AEBITD" Type="N">177727.40000</Ratio>
<Ratio FieldName="ANIAC" Type="N">86070.79000</Ratio>
</Group>
</Ratios>
</ReportSnapshot>
我想将此信息转换为 CSV 格式,格式如下:
CompanyName Ticker Industry type="TRBC" Industry type="NAICS" LastModified ReportingCurrency NPRICE MKTCAP
HSBC Holdings plc (Hong Kong) 5 Banks - NEC Commercial Banking 2018-07-20 USD 74.75000 1493871.00000
用于写入 CSV 文件 Python 具有内置 csv
模块。对于解析 XML 文件,我推荐 BeautifulSoup
- 这样问题就变得容易了:
xml_data = """<ReportSnapshot Major="1" Minor="0" Revision="1">
<CoIDs>
<CoID Type="RepNo">AC317</CoID>
<CoID Type="CompanyName">HSBC Holdings plc (Hong Kong)</CoID>
</CoIDs>
<Issues>
<Issue ID="1" Type="C" Desc="Common Stock" Order="1">
<IssueID Type="Name">Ordinary Shares</IssueID>
<IssueID Type="Ticker">5</IssueID>
<IssueID Type="CUSIP">G4634U169</IssueID>
<IssueID Type="ISIN">GB0005405286</IssueID>
<IssueID Type="RIC">0005.HK</IssueID>
<IssueID Type="SEDOL">6158163</IssueID>
<IssueID Type="DisplayRIC">0005.HK</IssueID>
<IssueID Type="InstrumentPI">312270</IssueID>
<IssueID Type="QuotePI">1049324</IssueID>
<Exchange Code="HKG" Country="HKG">Hong Kong Stock Exchange</Exchange>
<MostRecentSplit Date="2009-03-12">1.14753</MostRecentSplit>
</Issue>
</Issues>
<CoGeneralInfo>
<CoStatus Code="1">Active</CoStatus>
<CoType Code="EQU">Equity Issue</CoType>
<LastModified>2018-07-20</LastModified>
<LatestAvailableAnnual>2017-12-31</LatestAvailableAnnual>
<LatestAvailableInterim>2018-03-31</LatestAvailableInterim>
<Employees LastUpdated="2018-03-31">228899</Employees>
<SharesOut Date="2018-07-25" TotalFloat="19880413090.0">19949959451.0</SharesOut>
<ReportingCurrency Code="USD">U.S. Dollars</ReportingCurrency>
<MostRecentExchange Date="2018-07-25">1.0</MostRecentExchange>
</CoGeneralInfo>
<peerInfo lastUpdated="2018-07-20T09:20:26">
<IndustryInfo>
<Industry type="TRBC" order="1" reported="0" code="5510101010" mnem="">Banks - NEC</Industry>
<Industry type="NAICS" order="1" reported="0" code="52211" mnem="">Commercial Banking</Industry>
<Industry type="NAICS" order="2" reported="0" code="52393" mnem="">Investment Advice</Industry>
<Industry type="NAICS" order="3" reported="0" code="52392" mnem="">Portfolio Management</Industry>
<Industry type="SIC" order="0" reported="1" code="6035" mnem="">Federal Savings Institutions</Industry>
<Industry type="SIC" order="1" reported="0" code="6029" mnem="">Commercial Banks, Nec</Industry>
<Industry type="SIC" order="2" reported="0" code="6282" mnem="">Investment Advice</Industry>
</IndustryInfo>
</peerInfo>
<Ratios PriceCurrency="HKD" ReportingCurrency="USD" ExchangeRate="7.84530" LatestAvailableDate="2017-12-31">
<Group ID="Price and Volume">
<Ratio FieldName="NPRICE" Type="N">74.75000</Ratio>
<Ratio FieldName="NHIG" Type="N">86.00000</Ratio>
<Ratio FieldName="NLOW" Type="N">71.45000</Ratio>
<Ratio FieldName="PDATE" Type="D">2018-07-26T00:00:00</Ratio>
<Ratio FieldName="VOL10DAVG" Type="N">12.85415</Ratio>
<Ratio FieldName="EV" Type="N">2455297.00000</Ratio>
</Group>
<Group ID="Income Statement">
<Ratio FieldName="MKTCAP" Type="N">1493871.00000</Ratio>
<Ratio FieldName="AREV" Type="N">321618.10000</Ratio>
<Ratio FieldName="AEBITD" Type="N">177727.40000</Ratio>
<Ratio FieldName="ANIAC" Type="N">86070.79000</Ratio>
</Group>
</Ratios>
</ReportSnapshot>"""
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(xml_data, 'xml')
headers = ['CompanyName',
'Ticker',
'Industry type="TRBC"',
'Industry type="NAICS"',
'LastModified',
'ReportingCurrency',
'NPRICE',
'MKTCAP']
with open('data.csv', 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
csvwriter.writerow(headers)
row = []
row.append(soup.select_one('CoID[Type="CompanyName"]').text)
row.append(soup.select_one('IssueID[Type="Ticker"]').text)
row.append(soup.select_one('Industry[type="TRBC"]').text)
row.append(soup.select_one('Industry[type="NAICS"]').text)
row.append(soup.select_one('LastModified').text)
row.append(soup.select_one('ReportingCurrency[Code]')['Code'])
row.append(soup.select_one('Ratio[FieldName="NPRICE"]').text)
row.append(soup.select_one('Ratio[FieldName="MKTCAP]"').text)
csvwriter.writerow(row)
结果在 data.csv 文件中(来自 LibreOffice 的屏幕截图):