复杂的 XML 解析为 CSV (Python 3.x) (Interactive Broker)

Complicated XML parsing to CSV (Python 3.x) (Interactive Broker)

除了简单的 XML 解析如下 link:

parsing interactive broker fundamental data

我在 XML 解析中遇到了更多困难的情况:

两个主要错误:

XML =

<ReportSnapshot Major="1" Minor="0" Revision="1">
    <CoIDs>
       <CoID Type="RepNo">AC317</CoID>
        <CoID Type="CompanyName">HSBC Holdings plc (Hong Kong)</CoID>
    </CoIDs>
    <Issues>
        <Issue ID="1" Type="C" Desc="Common Stock" Order="1">
            <IssueID Type="Name">Ordinary Shares</IssueID>
            <IssueID Type="Ticker">5</IssueID>
            <IssueID Type="CUSIP">G4634U169</IssueID>
            <IssueID Type="ISIN">GB0005405286</IssueID>
            <IssueID Type="RIC">0005.HK</IssueID>
            <IssueID Type="SEDOL">6158163</IssueID>
            <IssueID Type="DisplayRIC">0005.HK</IssueID>
            <IssueID Type="InstrumentPI">312270</IssueID>
            <IssueID Type="QuotePI">1049324</IssueID>
            <Exchange Code="HKG" Country="HKG">Hong Kong Stock Exchange</Exchange>
            <MostRecentSplit Date="2009-03-12">1.14753</MostRecentSplit>
        </Issue>
    </Issues>

    <CoGeneralInfo>
        <CoStatus Code="1">Active</CoStatus>
        <CoType Code="EQU">Equity Issue</CoType>
        <LastModified>2018-07-20</LastModified>
        <LatestAvailableAnnual>2017-12-31</LatestAvailableAnnual>
        <LatestAvailableInterim>2018-03-31</LatestAvailableInterim>
        <Employees LastUpdated="2018-03-31">228899</Employees>
        <SharesOut Date="2018-07-25" TotalFloat="19880413090.0">19949959451.0</SharesOut>
        <ReportingCurrency Code="USD">U.S. Dollars</ReportingCurrency>
        <MostRecentExchange Date="2018-07-25">1.0</MostRecentExchange>
    </CoGeneralInfo>
    <peerInfo lastUpdated="2018-07-20T09:20:26">
        <IndustryInfo>
            <Industry type="TRBC" order="1" reported="0" code="5510101010" mnem="">Banks - NEC</Industry>
            <Industry type="NAICS" order="1" reported="0" code="52211" mnem="">Commercial Banking</Industry>
            <Industry type="NAICS" order="2" reported="0" code="52393" mnem="">Investment Advice</Industry>
            <Industry type="NAICS" order="3" reported="0" code="52392" mnem="">Portfolio Management</Industry>
            <Industry type="SIC" order="0" reported="1" code="6035" mnem="">Federal Savings Institutions</Industry>
            <Industry type="SIC" order="1" reported="0" code="6029" mnem="">Commercial Banks, Nec</Industry>
            <Industry type="SIC" order="2" reported="0" code="6282" mnem="">Investment Advice</Industry>
        </IndustryInfo>
    </peerInfo>

    <Ratios PriceCurrency="HKD" ReportingCurrency="USD" ExchangeRate="7.84530" LatestAvailableDate="2017-12-31">
        <Group ID="Price and Volume">
            <Ratio FieldName="NPRICE" Type="N">74.75000</Ratio>
            <Ratio FieldName="NHIG" Type="N">86.00000</Ratio>
            <Ratio FieldName="NLOW" Type="N">71.45000</Ratio>
            <Ratio FieldName="PDATE" Type="D">2018-07-26T00:00:00</Ratio>
            <Ratio FieldName="VOL10DAVG" Type="N">12.85415</Ratio>
            <Ratio FieldName="EV" Type="N">2455297.00000</Ratio>
        </Group>
        <Group ID="Income Statement">
            <Ratio FieldName="MKTCAP" Type="N">1493871.00000</Ratio>
            <Ratio FieldName="AREV" Type="N">321618.10000</Ratio>
            <Ratio FieldName="AEBITD" Type="N">177727.40000</Ratio>
            <Ratio FieldName="ANIAC" Type="N">86070.79000</Ratio>
        </Group>
    </Ratios>
</ReportSnapshot>

我想将此信息转换为 CSV 格式,格式如下:

CompanyName                     Ticker   Industry type="TRBC" Industry type="NAICS"  LastModified   ReportingCurrency   NPRICE     MKTCAP
HSBC Holdings plc (Hong Kong)    5       Banks - NEC          Commercial Banking       2018-07-20      USD              74.75000   1493871.00000

用于写入 CSV 文件 Python 具有内置 csv 模块。对于解析 XML 文件,我推荐 BeautifulSoup - 这样问题就变得容易了:

xml_data = """<ReportSnapshot Major="1" Minor="0" Revision="1">
    <CoIDs>
       <CoID Type="RepNo">AC317</CoID>
        <CoID Type="CompanyName">HSBC Holdings plc (Hong Kong)</CoID>
    </CoIDs>
    <Issues>
        <Issue ID="1" Type="C" Desc="Common Stock" Order="1">
            <IssueID Type="Name">Ordinary Shares</IssueID>
            <IssueID Type="Ticker">5</IssueID>
            <IssueID Type="CUSIP">G4634U169</IssueID>
            <IssueID Type="ISIN">GB0005405286</IssueID>
            <IssueID Type="RIC">0005.HK</IssueID>
            <IssueID Type="SEDOL">6158163</IssueID>
            <IssueID Type="DisplayRIC">0005.HK</IssueID>
            <IssueID Type="InstrumentPI">312270</IssueID>
            <IssueID Type="QuotePI">1049324</IssueID>
            <Exchange Code="HKG" Country="HKG">Hong Kong Stock Exchange</Exchange>
            <MostRecentSplit Date="2009-03-12">1.14753</MostRecentSplit>
        </Issue>
    </Issues>

    <CoGeneralInfo>
        <CoStatus Code="1">Active</CoStatus>
        <CoType Code="EQU">Equity Issue</CoType>
        <LastModified>2018-07-20</LastModified>
        <LatestAvailableAnnual>2017-12-31</LatestAvailableAnnual>
        <LatestAvailableInterim>2018-03-31</LatestAvailableInterim>
        <Employees LastUpdated="2018-03-31">228899</Employees>
        <SharesOut Date="2018-07-25" TotalFloat="19880413090.0">19949959451.0</SharesOut>
        <ReportingCurrency Code="USD">U.S. Dollars</ReportingCurrency>
        <MostRecentExchange Date="2018-07-25">1.0</MostRecentExchange>
    </CoGeneralInfo>
    <peerInfo lastUpdated="2018-07-20T09:20:26">
        <IndustryInfo>
            <Industry type="TRBC" order="1" reported="0" code="5510101010" mnem="">Banks - NEC</Industry>
            <Industry type="NAICS" order="1" reported="0" code="52211" mnem="">Commercial Banking</Industry>
            <Industry type="NAICS" order="2" reported="0" code="52393" mnem="">Investment Advice</Industry>
            <Industry type="NAICS" order="3" reported="0" code="52392" mnem="">Portfolio Management</Industry>
            <Industry type="SIC" order="0" reported="1" code="6035" mnem="">Federal Savings Institutions</Industry>
            <Industry type="SIC" order="1" reported="0" code="6029" mnem="">Commercial Banks, Nec</Industry>
            <Industry type="SIC" order="2" reported="0" code="6282" mnem="">Investment Advice</Industry>
        </IndustryInfo>
    </peerInfo>

    <Ratios PriceCurrency="HKD" ReportingCurrency="USD" ExchangeRate="7.84530" LatestAvailableDate="2017-12-31">
        <Group ID="Price and Volume">
            <Ratio FieldName="NPRICE" Type="N">74.75000</Ratio>
            <Ratio FieldName="NHIG" Type="N">86.00000</Ratio>
            <Ratio FieldName="NLOW" Type="N">71.45000</Ratio>
            <Ratio FieldName="PDATE" Type="D">2018-07-26T00:00:00</Ratio>
            <Ratio FieldName="VOL10DAVG" Type="N">12.85415</Ratio>
            <Ratio FieldName="EV" Type="N">2455297.00000</Ratio>
        </Group>
        <Group ID="Income Statement">
            <Ratio FieldName="MKTCAP" Type="N">1493871.00000</Ratio>
            <Ratio FieldName="AREV" Type="N">321618.10000</Ratio>
            <Ratio FieldName="AEBITD" Type="N">177727.40000</Ratio>
            <Ratio FieldName="ANIAC" Type="N">86070.79000</Ratio>
        </Group>
    </Ratios>
</ReportSnapshot>"""

from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup(xml_data, 'xml')

headers = ['CompanyName',
           'Ticker',
           'Industry type="TRBC"',
           'Industry type="NAICS"',
           'LastModified',
           'ReportingCurrency',
           'NPRICE',
           'MKTCAP']


with open('data.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
    csvwriter.writerow(headers)
    row = []
    row.append(soup.select_one('CoID[Type="CompanyName"]').text)
    row.append(soup.select_one('IssueID[Type="Ticker"]').text)
    row.append(soup.select_one('Industry[type="TRBC"]').text)
    row.append(soup.select_one('Industry[type="NAICS"]').text)
    row.append(soup.select_one('LastModified').text)
    row.append(soup.select_one('ReportingCurrency[Code]')['Code'])
    row.append(soup.select_one('Ratio[FieldName="NPRICE"]').text)
    row.append(soup.select_one('Ratio[FieldName="MKTCAP]"').text)
    csvwriter.writerow(row)

结果在 data.csv 文件中(来自 LibreOffice 的屏幕截图):