Pandas 阅读 xml 单个标签无法正常工作 xml

Pandas read xml not working properly for single tag xml

我正在使用 pandas_read_xml 包读取 xml 文件并将其处理为 pandas 数据帧。在绝大多数情况下,该软件包对我的目的来说绝对没问题。但是,当仅使用单个标签读取 url 时,数据帧输出有点关闭。让我用以下两个例子来说明这一点。

# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 1
url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml’
df_1 =  pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_1 = pdx.fully_flatten(df_1)

结果 df_1 包含 163 行和 31 列,其中每行对应一个唯一的证券。这符合我想要的结果。但是,当我尝试读取 xml 时,输出有点奇怪,其中只有一次出现标签 'invstOrSec'.

# Example 2
url_2 = ‘https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml’
df_2  = pdx.read_xml(url_2,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_2 = pdx.fully_flatten(df_2)

生成的 df_2 包含 6 行和 19 列。我真的无法理解为什么它包含 6 行,而实际上它应该是 1 行。我观察到这种行为只发生在标签 'invstOrSec' 只出现一次的情况下。对此的任何帮助将不胜感激。如果我的问题不清楚,请告诉我。

首先感谢反馈!我写 pandas-read-xml 因为 pandas 没有 pd.read_xml() 实现。您(和我们其他人)会很高兴知道 pandas read_xml 的开发版本即将推出! (https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html)

至于你目前的难题,这是 XML 结构的结果(也是我的许多不喜欢之一)。与可以在列表中返回单个元素的 JSON 不同,XML 结构只有一个 XML 标记,它被解释为单个值而不是列表。

基本上,如果只有一个“行”标签,那么“列”标签现在被视为列标签...我没有多大意义,对吗?让我用你的例子来解释。

以下是我建议您使用它的方法:

# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 1
url_1 = 'https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml'
df_1 =  pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec']).pipe(fully_flatten)

# Example 2
url_2 = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df_2  = pdx.read_xml(url_2,['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True).pipe(fully_flatten)
df_2

有什么区别?

在示例 1 中,您已经期望在标记内有多个。 因此,在后台传递 root_tag_list=['edgarSubmission'、'formData'、'invstOrSecs'、'invstOrSec'] returns 列表。 fully_flatten 过程首先将列表分解成行。

在示例 2 中,如果您使用相同的 root_tag_list,则 pandas 不在列表中读取。相反,它正在读取对应于单行的字典。实际上,它将打算作为列的标签视为行。相反,我会在其上方传递一个标签作为根标签,然后转置它,然后 fully_flatten.

是的...我知道...这是一种解决方法。但是……话又说回来,我并没有创造pandas-read-xml希望解决所有的问题。在 pandas 原生支持阅读 XML 之前,它一直是一个临时解决方案(看起来即将推出)。

让我知道进展如何!

编辑:

关于如何使 XML 到 pandas DataFrame 转换可以根据 XML 是否只有一个“行”标签或多个,我有以下两个选项。

在多行情况下,DataFrame 将生成具有整数索引(行号)的 DataFrame,而在单行情况下,DataFrame 索引将是“字符串”,本来是列。因此,一种策略是检测并相应地重新做。 (您可以使用更智能的方法避免重复下载)

# Import package
import pandas as pd
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 3

dfs = []
url_components = ['1279392/000114554921008161', '1279394/000114554921008162']

for url_component in url_components:
    url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
    temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'])
    if 0 not in temp.index:
        temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True)
    else:
        temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs', 'invstOrSec'])
    dfs.append(temp)

df = pd.concat(dfs, ignore_index=True).pipe(fully_flatten)

df

另一种选择是使用底层工具。 pandas_read_xml 背后没有魔法,它使用了一个名为 xmltodict 的包。读取 XML,转换为字典,然后转换为 pandas,然后展平。唯一的缺点是,由于保留了标签“invstOrSec”的名称,它们成为列名的前缀。您应该能够轻松删除它们。

# Import package
import pandas as pd
import pandas_read_xml as pdx
import xmltodict
from pandas_read_xml import fully_flatten

# Example 4

url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
xmldicts = []

for url_component in url_components:
    url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
    xml = pdx.read_xml_from_url(url)
    xmldicts.append(xmltodict.parse(xml)['edgarSubmission']['formData']['invstOrSecs'])
    
df = pd.DataFrame.from_dict(xmldicts).pipe(fully_flatten)

df

希望对您有所帮助!

编辑:

所以,我更新了软件包(现在是 0.2.0 版)。现在 pandas_read_xml 应该默认将根标签视为生成的 pandas 数据帧中的行,因此无需区分有时具有单个“行”的 XMLs有时会有多行。

如果这在其他情况下是一个问题,那么有一个新参数 root_is_rows 默认为 True,但可以设为 False。

事实上,在即将发布的 Pandas 1.3 中,read_xml 将允许您将已解析的节点迁移到数据帧中。然而,因为 XML 可以有许多超出二维行列的维度,如前所述:

This method is best designed to import shallow XML documents

因此,任何嵌套元素都不会立即被拾取,如此处所示大约有 20 列。由于文档中的默认命名空间,请注意需要使用 namespaces

Pandas 1.3+

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", 
                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})

print(df)
#                                                   name  lei                                              title      cusip  ...  fairValLevel  securityLending  assetCat debtSec
# 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           3.0              NaN      None     NaN
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  NaN  Regatta XV Funding Ltd., Subordinated Note, Pr...  75888PAC7  ...           2.0              NaN  ABS-CBDO     NaN
# 2                Hired, Inc., Series C Preferred Stock  NaN              Hired, Inc., Series C Preferred Stock        NaN  ...           3.0              NaN        EP     NaN
# 3                      WESTVIEW CAPITAL PARTNERS II LP  NaN                    WESTVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN              NaN      None     NaN
# 4                       VOYAGER CAPITAL FUND III, L.P.  NaN                     VOYAGER CAPITAL FUND III, L.P.  999999999  ...           NaN              NaN      None     NaN
..                                                 ...  ...                                                ...        ...  ...           ...              ...       ...     ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  NaN              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  999999999  ...           NaN              NaN      None     NaN
# 159                       ALLOY MERCHANT PARTNERS L.P.  NaN                       ALLOY MERCHANT PARTNERS L.P.  999999999  ...           NaN              NaN      None     NaN
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN              NaN      None     NaN
# 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN              NaN      None     NaN
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN              NaN      None     NaN

# [163 rows x 20 columns]


url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", 
                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})

print(df)
#                                        name  lei                                     title      cusip  ...  invCountry  isRestrictedSec fairValLevel securityLending
# 0  Salient Private Access Master Fund, L.P.  NaN  Salient Private Access Master Fund, L.P.  999999999  ...          US                Y          NaN             NaN

# [1 rows x 18 columns]

幸运的是,read_xml 使用默认的 lxml 解析器支持 XSLT(设计用于转换 XML 文档的专用语言)。使用 XSLT,您可以展平迁移所需的节点以检索 32 列。

xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                       xmlns:edgar="http://www.sec.gov/edgar/nport">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="edgar:invstOrSec">
        <xsl:copy>
            <xsl:apply-templates select="*|*/*"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
"""

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", namespaces={"edgar": "http://www.sec.gov/edgar/nport"},
                 stylesheet=xsl)
print(df)
#                                                   name  lei                                              title      cusip  ...  annualizedRt  isDefault  areIntrstPmntsInArrs  isPaidKind
# 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           NaN       None                  None        None
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  NaN  Regatta XV Funding Ltd., Subordinated Note, Pr...  75888PAC7  ...        0.0624          N                     N           N
# 2                Hired, Inc., Series C Preferred Stock  NaN              Hired, Inc., Series C Preferred Stock        NaN  ...           NaN       None                  None        None
# 3                      WESTVIEW CAPITAL PARTNERS II LP  NaN                    WESTVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN       None                  None        None
# 4                       VOYAGER CAPITAL FUND III, L.P.  NaN                     VOYAGER CAPITAL FUND III, L.P.  999999999  ...           NaN       None                  None        None
..                                                 ...  ...                                                ...        ...  ...           ...        ...                   ...         ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  NaN              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  999999999  ...           NaN       None                  None        None
# 159                       ALLOY MERCHANT PARTNERS L.P.  NaN                       ALLOY MERCHANT PARTNERS L.P.  999999999  ...           NaN       None                  None        None
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN       None                  None        None
# 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN       None                  None        None
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN       None                  None        None

# [163 rows x 32 columns]

Pandas < 1.3

要通过 XPath 方法获得相同的结果需要更多步骤,您必须在这些步骤中处理 URL 请求和 XML 解析以构建数据框。具体来说,从转换、解析的 XML 创建一个字典列表并传递给 DataFrame 构造函数。下面使用与上面相同的带有命名空间的 XSLT 和 XPath。

import lxml.etree as lx
import pandas as pd
import urllib.request as rq

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"

xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                       xmlns:edgar="http://www.sec.gov/edgar/nport">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="edgar:invstOrSec">
        <xsl:copy>
            <xsl:apply-templates select="*|*/*"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
"""

content = rq.urlopen(url)

# LOAD XML AND XSL
doc = lx.fromstring(content.read())
style = lx.fromstring(xsl)

# INITIALIZE AND TRANSFORM ORIGINAL DOC
transformer = lx.XSLT(style)
result = transformer(doc)

# RUN XPATH PARSING ON FLATTER XML
data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
        } for inv in result.xpath("//edgar:invstOrSec", 
                                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]

# BIND DATA FOR DATA FRAME
df = pd.DataFrame(data)

print(df)
#                                                   name  lei                                              title  ... isDefault areIntrstPmntsInArrs  isPaidKind
# 0                                       Tastemade Inc.  N/A                                     Tastemade Inc.  ...       NaN                  NaN         NaN
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  N/A  Regatta XV Funding Ltd., Subordinated Note, Pr...  ...         N                    N           N
# 2                Hired, Inc., Series C Preferred Stock  N/A              Hired, Inc., Series C Preferred Stock  ...       NaN                  NaN         NaN
# 3                      WESTVIEW CAPITAL PARTNERS II LP  N/A                    WESTVIEW CAPITAL PARTNERS II LP  ...       NaN                  NaN         NaN
# 4                       VOYAGER CAPITAL FUND III, L.P.  N/A                     VOYAGER CAPITAL FUND III, L.P.  ...       NaN                  NaN         NaN
# ..                                                 ...  ...                                                ...  ...       ...                  ...         ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  N/A              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  ...       NaN                  NaN         NaN
# 159                       ALLOY MERCHANT PARTNERS L.P.  N/A                       ALLOY MERCHANT PARTNERS L.P.  ...       NaN                  NaN         NaN
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  N/A  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  ...       NaN                  NaN         NaN
# 161                   ABRY ADVANCED SECURITIES FUND LP  N/A                   ABRY ADVANCED SECURITIES FUND LP  ...       NaN                  NaN         NaN
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  N/A  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  ...       NaN                  NaN         NaN

# [163 rows x 32 columns]