Pandas 阅读 xml 单个标签无法正常工作 xml
Pandas read xml not working properly for single tag xml
我正在使用 pandas_read_xml 包读取 xml 文件并将其处理为 pandas 数据帧。在绝大多数情况下,该软件包对我的目的来说绝对没问题。但是,当仅使用单个标签读取 url 时,数据帧输出有点关闭。让我用以下两个例子来说明这一点。
# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
# Example 1
url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml’
df_1 = pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_1 = pdx.fully_flatten(df_1)
结果 df_1 包含 163 行和 31 列,其中每行对应一个唯一的证券。这符合我想要的结果。但是,当我尝试读取 xml 时,输出有点奇怪,其中只有一次出现标签 'invstOrSec'.
# Example 2
url_2 = ‘https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml’
df_2 = pdx.read_xml(url_2,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_2 = pdx.fully_flatten(df_2)
生成的 df_2 包含 6 行和 19 列。我真的无法理解为什么它包含 6 行,而实际上它应该是 1 行。我观察到这种行为只发生在标签 'invstOrSec' 只出现一次的情况下。对此的任何帮助将不胜感激。如果我的问题不清楚,请告诉我。
首先感谢反馈!我写 pandas-read-xml 因为 pandas 没有 pd.read_xml() 实现。您(和我们其他人)会很高兴知道 pandas read_xml 的开发版本即将推出! (https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html)
至于你目前的难题,这是 XML 结构的结果(也是我的许多不喜欢之一)。与可以在列表中返回单个元素的 JSON 不同,XML 结构只有一个 XML 标记,它被解释为单个值而不是列表。
基本上,如果只有一个“行”标签,那么“列”标签现在被视为列标签...我没有多大意义,对吗?让我用你的例子来解释。
以下是我建议您使用它的方法:
# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
# Example 1
url_1 = 'https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml'
df_1 = pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec']).pipe(fully_flatten)
# Example 2
url_2 = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df_2 = pdx.read_xml(url_2,['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True).pipe(fully_flatten)
df_2
有什么区别?
在示例 1 中,您已经期望在标记内有多个。
因此,在后台传递 root_tag_list=['edgarSubmission'、'formData'、'invstOrSecs'、'invstOrSec'] returns 列表。 fully_flatten 过程首先将列表分解成行。
在示例 2 中,如果您使用相同的 root_tag_list,则 pandas 不在列表中读取。相反,它正在读取对应于单行的字典。实际上,它将打算作为列的标签视为行。相反,我会在其上方传递一个标签作为根标签,然后转置它,然后 fully_flatten.
是的...我知道...这是一种解决方法。但是……话又说回来,我并没有创造pandas-read-xml希望解决所有的问题。在 pandas 原生支持阅读 XML 之前,它一直是一个临时解决方案(看起来即将推出)。
让我知道进展如何!
编辑:
关于如何使 XML 到 pandas DataFrame 转换可以根据 XML 是否只有一个“行”标签或多个,我有以下两个选项。
在多行情况下,DataFrame 将生成具有整数索引(行号)的 DataFrame,而在单行情况下,DataFrame 索引将是“字符串”,本来是列。因此,一种策略是检测并相应地重新做。 (您可以使用更智能的方法避免重复下载)
# Import package
import pandas as pd
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
# Example 3
dfs = []
url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
for url_component in url_components:
url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'])
if 0 not in temp.index:
temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True)
else:
temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs', 'invstOrSec'])
dfs.append(temp)
df = pd.concat(dfs, ignore_index=True).pipe(fully_flatten)
df
另一种选择是使用底层工具。 pandas_read_xml 背后没有魔法,它使用了一个名为 xmltodict 的包。读取 XML,转换为字典,然后转换为 pandas,然后展平。唯一的缺点是,由于保留了标签“invstOrSec”的名称,它们成为列名的前缀。您应该能够轻松删除它们。
# Import package
import pandas as pd
import pandas_read_xml as pdx
import xmltodict
from pandas_read_xml import fully_flatten
# Example 4
url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
xmldicts = []
for url_component in url_components:
url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
xml = pdx.read_xml_from_url(url)
xmldicts.append(xmltodict.parse(xml)['edgarSubmission']['formData']['invstOrSecs'])
df = pd.DataFrame.from_dict(xmldicts).pipe(fully_flatten)
df
希望对您有所帮助!
编辑:
所以,我更新了软件包(现在是 0.2.0 版)。现在 pandas_read_xml 应该默认将根标签视为生成的 pandas 数据帧中的行,因此无需区分有时具有单个“行”的 XMLs有时会有多行。
如果这在其他情况下是一个问题,那么有一个新参数 root_is_rows
默认为 True,但可以设为 False。
事实上,在即将发布的 Pandas 1.3 中,read_xml
将允许您将已解析的节点迁移到数据帧中。然而,因为 XML 可以有许多超出二维行列的维度,如前所述:
This method is best designed to import shallow XML documents
因此,任何嵌套元素都不会立即被拾取,如此处所示大约有 20 列。由于文档中的默认命名空间,请注意需要使用 namespaces
。
Pandas 1.3+
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... fairValLevel securityLending assetCat debtSec
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... 3.0 NaN None NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 2.0 NaN ABS-CBDO NaN
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... 3.0 NaN EP NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN NaN None NaN
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN NaN None NaN
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN NaN None NaN
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN NaN None NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN NaN None NaN
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN NaN None NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN NaN None NaN
# [163 rows x 20 columns]
url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... invCountry isRestrictedSec fairValLevel securityLending
# 0 Salient Private Access Master Fund, L.P. NaN Salient Private Access Master Fund, L.P. 999999999 ... US Y NaN NaN
# [1 rows x 18 columns]
幸运的是,read_xml
使用默认的 lxml
解析器支持 XSLT(设计用于转换 XML 文档的专用语言)。使用 XSLT,您可以展平迁移所需的节点以检索 32 列。
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", namespaces={"edgar": "http://www.sec.gov/edgar/nport"},
stylesheet=xsl)
print(df)
# name lei title cusip ... annualizedRt isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... NaN None None None
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 0.0624 N N N
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... NaN None None None
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN None None None
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN None None None
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN None None None
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN None None None
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN None None None
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN None None None
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN None None None
# [163 rows x 32 columns]
Pandas < 1.3
要通过 XPath 方法获得相同的结果需要更多步骤,您必须在这些步骤中处理 URL 请求和 XML 解析以构建数据框。具体来说,从转换、解析的 XML 创建一个字典列表并传递给 DataFrame
构造函数。下面使用与上面相同的带有命名空间的 XSLT 和 XPath。
import lxml.etree as lx
import pandas as pd
import urllib.request as rq
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
content = rq.urlopen(url)
# LOAD XML AND XSL
doc = lx.fromstring(content.read())
style = lx.fromstring(xsl)
# INITIALIZE AND TRANSFORM ORIGINAL DOC
transformer = lx.XSLT(style)
result = transformer(doc)
# RUN XPATH PARSING ON FLATTER XML
data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
} for inv in result.xpath("//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]
# BIND DATA FOR DATA FRAME
df = pd.DataFrame(data)
print(df)
# name lei title ... isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. N/A Tastemade Inc. ... NaN NaN NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... N/A Regatta XV Funding Ltd., Subordinated Note, Pr... ... N N N
# 2 Hired, Inc., Series C Preferred Stock N/A Hired, Inc., Series C Preferred Stock ... NaN NaN NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP N/A WESTVIEW CAPITAL PARTNERS II LP ... NaN NaN NaN
# 4 VOYAGER CAPITAL FUND III, L.P. N/A VOYAGER CAPITAL FUND III, L.P. ... NaN NaN NaN
# .. ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. N/A ARCLIGHT ENERGY PARTNERS FUND V, L.P. ... NaN NaN NaN
# 159 ALLOY MERCHANT PARTNERS L.P. N/A ALLOY MERCHANT PARTNERS L.P. ... NaN NaN NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... ... NaN NaN NaN
# 161 ABRY ADVANCED SECURITIES FUND LP N/A ABRY ADVANCED SECURITIES FUND LP ... NaN NaN NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... ... NaN NaN NaN
# [163 rows x 32 columns]
我正在使用 pandas_read_xml 包读取 xml 文件并将其处理为 pandas 数据帧。在绝大多数情况下,该软件包对我的目的来说绝对没问题。但是,当仅使用单个标签读取 url 时,数据帧输出有点关闭。让我用以下两个例子来说明这一点。
# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
# Example 1
url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml’
df_1 = pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_1 = pdx.fully_flatten(df_1)
结果 df_1 包含 163 行和 31 列,其中每行对应一个唯一的证券。这符合我想要的结果。但是,当我尝试读取 xml 时,输出有点奇怪,其中只有一次出现标签 'invstOrSec'.
# Example 2
url_2 = ‘https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml’
df_2 = pdx.read_xml(url_2,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_2 = pdx.fully_flatten(df_2)
生成的 df_2 包含 6 行和 19 列。我真的无法理解为什么它包含 6 行,而实际上它应该是 1 行。我观察到这种行为只发生在标签 'invstOrSec' 只出现一次的情况下。对此的任何帮助将不胜感激。如果我的问题不清楚,请告诉我。
首先感谢反馈!我写 pandas-read-xml 因为 pandas 没有 pd.read_xml() 实现。您(和我们其他人)会很高兴知道 pandas read_xml 的开发版本即将推出! (https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html)
至于你目前的难题,这是 XML 结构的结果(也是我的许多不喜欢之一)。与可以在列表中返回单个元素的 JSON 不同,XML 结构只有一个 XML 标记,它被解释为单个值而不是列表。
基本上,如果只有一个“行”标签,那么“列”标签现在被视为列标签...我没有多大意义,对吗?让我用你的例子来解释。
以下是我建议您使用它的方法:
# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
# Example 1
url_1 = 'https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml'
df_1 = pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec']).pipe(fully_flatten)
# Example 2
url_2 = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df_2 = pdx.read_xml(url_2,['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True).pipe(fully_flatten)
df_2
有什么区别?
在示例 1 中,您已经期望在标记内有多个。 因此,在后台传递 root_tag_list=['edgarSubmission'、'formData'、'invstOrSecs'、'invstOrSec'] returns 列表。 fully_flatten 过程首先将列表分解成行。
在示例 2 中,如果您使用相同的 root_tag_list,则 pandas 不在列表中读取。相反,它正在读取对应于单行的字典。实际上,它将打算作为列的标签视为行。相反,我会在其上方传递一个标签作为根标签,然后转置它,然后 fully_flatten.
是的...我知道...这是一种解决方法。但是……话又说回来,我并没有创造pandas-read-xml希望解决所有的问题。在 pandas 原生支持阅读 XML 之前,它一直是一个临时解决方案(看起来即将推出)。
让我知道进展如何!
编辑:
关于如何使 XML 到 pandas DataFrame 转换可以根据 XML 是否只有一个“行”标签或多个,我有以下两个选项。
在多行情况下,DataFrame 将生成具有整数索引(行号)的 DataFrame,而在单行情况下,DataFrame 索引将是“字符串”,本来是列。因此,一种策略是检测并相应地重新做。 (您可以使用更智能的方法避免重复下载)
# Import package
import pandas as pd
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
# Example 3
dfs = []
url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
for url_component in url_components:
url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'])
if 0 not in temp.index:
temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True)
else:
temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs', 'invstOrSec'])
dfs.append(temp)
df = pd.concat(dfs, ignore_index=True).pipe(fully_flatten)
df
另一种选择是使用底层工具。 pandas_read_xml 背后没有魔法,它使用了一个名为 xmltodict 的包。读取 XML,转换为字典,然后转换为 pandas,然后展平。唯一的缺点是,由于保留了标签“invstOrSec”的名称,它们成为列名的前缀。您应该能够轻松删除它们。
# Import package
import pandas as pd
import pandas_read_xml as pdx
import xmltodict
from pandas_read_xml import fully_flatten
# Example 4
url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
xmldicts = []
for url_component in url_components:
url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
xml = pdx.read_xml_from_url(url)
xmldicts.append(xmltodict.parse(xml)['edgarSubmission']['formData']['invstOrSecs'])
df = pd.DataFrame.from_dict(xmldicts).pipe(fully_flatten)
df
希望对您有所帮助!
编辑:
所以,我更新了软件包(现在是 0.2.0 版)。现在 pandas_read_xml 应该默认将根标签视为生成的 pandas 数据帧中的行,因此无需区分有时具有单个“行”的 XMLs有时会有多行。
如果这在其他情况下是一个问题,那么有一个新参数 root_is_rows
默认为 True,但可以设为 False。
事实上,在即将发布的 Pandas 1.3 中,read_xml
将允许您将已解析的节点迁移到数据帧中。然而,因为 XML 可以有许多超出二维行列的维度,如前所述:
This method is best designed to import shallow XML documents
因此,任何嵌套元素都不会立即被拾取,如此处所示大约有 20 列。由于文档中的默认命名空间,请注意需要使用 namespaces
。
Pandas 1.3+
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... fairValLevel securityLending assetCat debtSec
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... 3.0 NaN None NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 2.0 NaN ABS-CBDO NaN
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... 3.0 NaN EP NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN NaN None NaN
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN NaN None NaN
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN NaN None NaN
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN NaN None NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN NaN None NaN
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN NaN None NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN NaN None NaN
# [163 rows x 20 columns]
url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... invCountry isRestrictedSec fairValLevel securityLending
# 0 Salient Private Access Master Fund, L.P. NaN Salient Private Access Master Fund, L.P. 999999999 ... US Y NaN NaN
# [1 rows x 18 columns]
幸运的是,read_xml
使用默认的 lxml
解析器支持 XSLT(设计用于转换 XML 文档的专用语言)。使用 XSLT,您可以展平迁移所需的节点以检索 32 列。
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", namespaces={"edgar": "http://www.sec.gov/edgar/nport"},
stylesheet=xsl)
print(df)
# name lei title cusip ... annualizedRt isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... NaN None None None
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 0.0624 N N N
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... NaN None None None
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN None None None
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN None None None
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN None None None
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN None None None
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN None None None
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN None None None
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN None None None
# [163 rows x 32 columns]
Pandas < 1.3
要通过 XPath 方法获得相同的结果需要更多步骤,您必须在这些步骤中处理 URL 请求和 XML 解析以构建数据框。具体来说,从转换、解析的 XML 创建一个字典列表并传递给 DataFrame
构造函数。下面使用与上面相同的带有命名空间的 XSLT 和 XPath。
import lxml.etree as lx
import pandas as pd
import urllib.request as rq
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
content = rq.urlopen(url)
# LOAD XML AND XSL
doc = lx.fromstring(content.read())
style = lx.fromstring(xsl)
# INITIALIZE AND TRANSFORM ORIGINAL DOC
transformer = lx.XSLT(style)
result = transformer(doc)
# RUN XPATH PARSING ON FLATTER XML
data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
} for inv in result.xpath("//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]
# BIND DATA FOR DATA FRAME
df = pd.DataFrame(data)
print(df)
# name lei title ... isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. N/A Tastemade Inc. ... NaN NaN NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... N/A Regatta XV Funding Ltd., Subordinated Note, Pr... ... N N N
# 2 Hired, Inc., Series C Preferred Stock N/A Hired, Inc., Series C Preferred Stock ... NaN NaN NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP N/A WESTVIEW CAPITAL PARTNERS II LP ... NaN NaN NaN
# 4 VOYAGER CAPITAL FUND III, L.P. N/A VOYAGER CAPITAL FUND III, L.P. ... NaN NaN NaN
# .. ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. N/A ARCLIGHT ENERGY PARTNERS FUND V, L.P. ... NaN NaN NaN
# 159 ALLOY MERCHANT PARTNERS L.P. N/A ALLOY MERCHANT PARTNERS L.P. ... NaN NaN NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... ... NaN NaN NaN
# 161 ABRY ADVANCED SECURITIES FUND LP N/A ABRY ADVANCED SECURITIES FUND LP ... NaN NaN NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... ... NaN NaN NaN
# [163 rows x 32 columns]