将 xml 解析为数据框,包括 R 中的子项和属性
Parse xml to dataframe including children and attributes in R
我正在尝试从附件中创建数据框 xml
https://1drv.ms/u/s!Am7buNMZi-gwgeBmbk6A-NRIRarjYw?e=Pcgm7c
我需要为所有玩家获取他们的列信息和关于团队(父)的信息
XML 样本
<SoccerFeed timestamp="20190519T183022+0000">
<SoccerDocument Type="SQUADS Latest" competition_code="ES_PL" competition_id="23" competition_name="Spanish La Liga" season_id="2018" season_name="Season 2018/2019">
<Team country="Spain" country_id="4" country_iso="ES" official_club_name="Deportivo Alavés S.A.D." region_id="17" region_name="Europe" short_club_name="Alavés" uID="t173">
<Founded>1921</Founded>
<Name>Alavés</Name>
<Player uID="p91406">
<Name>Fernando Pacheco</Name>
<Position>Goalkeeper</Position>
<Stat Type="first_name">Fernando</Stat>
<Stat Type="last_name">Pacheco</Stat>
<Stat Type="birth_date">1992-05-18</Stat>
<Stat Type="birth_place">Badajoz</Stat>
<Stat Type="first_nationality">Spain</Stat>
<Stat Type="preferred_foot">Left</Stat>
<Stat Type="weight">81</Stat>
<Stat Type="height">186</Stat>
<Stat Type="jersey_num">1</Stat>
<Stat Type="real_position">Goalkeeper</Stat>
<Stat Type="real_position_side">Unknown</Stat>
<Stat Type="join_date">2015-08-07</Stat>
<Stat Type="country">Spain</Stat>
</Player>
<Player uID="p176245">
<Name>Antonio Sivera</Name>
<Position>Goalkeeper</Position>
<Stat Type="first_name">Antonio</Stat>
<Stat Type="last_name">Sivera</Stat>
<Stat Type="birth_date">1996-08-11</Stat>
<Stat Type="birth_place">Jávea</Stat>
<Stat Type="first_nationality">Spain</Stat>
<Stat Type="preferred_foot">Right</Stat>
<Stat Type="weight">75</Stat>
<Stat Type="height">184</Stat>
<Stat Type="jersey_num">13</Stat>
<Stat Type="real_position">Goalkeeper</Stat>
<Stat Type="real_position_side">Unknown</Stat>
<Stat Type="join_date">2017-07-19</Stat>
<Stat Type="country">Spain</Stat>
</Player>
</Team>
<Team city="Madrid" country="Spain" country_id="4" country_iso="ES" official_club_name="Club Atlético de Madrid S.A.D" postal_code="28005" region_id="17" region_name="Europe" short_club_name="Atlético" street="Paseo Virgen del Puerto, 67" uID="t175" web_address="www.clubatleticodemadrid.com/">
<Founded>1903</Founded>
<Name>Atlético de Madrid</Name>
<Player uID="p59981">
<Name>Antonio Adán</Name>
<Position>Goalkeeper</Position>
<Stat Type="first_name">Antonio</Stat>
<Stat Type="last_name">Adán</Stat>
<Stat Type="birth_date">1987-05-13</Stat>
<Stat Type="birth_place">Madrid</Stat>
<Stat Type="first_nationality">Spain</Stat>
<Stat Type="preferred_foot">Left</Stat>
<Stat Type="weight">92</Stat>
<Stat Type="height">190</Stat>
<Stat Type="jersey_num">1</Stat>
<Stat Type="real_position">Goalkeeper</Stat>
<Stat Type="real_position_side">Unknown</Stat>
<Stat Type="join_date">2018-07-10</Stat>
<Stat Type="country">Spain</Stat>
</Player>
<Player uID="p81352">
<Name>Jan Oblak</Name>
<Position>Goalkeeper</Position>
<Stat Type="first_name">Jan</Stat>
<Stat Type="last_name">Oblak</Stat>
<Stat Type="birth_date">1993-01-07</Stat>
<Stat Type="birth_place">Skojfa Loka</Stat>
<Stat Type="first_nationality">Slovenia</Stat>
<Stat Type="preferred_foot">Right</Stat>
<Stat Type="weight">87</Stat>
<Stat Type="height">188</Stat>
<Stat Type="jersey_num">13</Stat>
<Stat Type="real_position">Goalkeeper</Stat>
<Stat Type="real_position_side">Unknown</Stat>
<Stat Type="join_date">2014-07-16</Stat>
<Stat Type="country">Slovenia</Stat>
</Player>
</Team>
</SoccerDocument>
</SoccerFeed>
我想要的栏目
团队专栏
- 国家(SoccerFeed/SoccerDocument/Team 属性)
- country_id(SoccerFeed/SoccerDocument/Team 属性)
- country_iso(SoccerFeed/SoccerDocument/Team 属性)
- official_club_name(SoccerFeed/SoccerDocument/Team 属性)
- region_id(SoccerFeed/SoccerDocument/Team 属性)
- region_name(SoccerFeed/SoccerDocument/Team 属性)
- short_club_name(SoccerFeed/SoccerDocument/Team 属性)
- team_uID(SoccerFeed/SoccerDocument/Team 属性 uID)
- team_name(SoccerFeed/SoccerDocument/Team/姓名)
- team_founded(SoccerFeed/SoccerDocument/Team/成立)
玩家专栏
player_uID (/SoccerFeed/SoccerDocument/Team/玩家)
player_name (/SoccerFeed/SoccerDocument/Team/玩家/姓名)
player_position (/SoccerFeed/SoccerDocument/Team/玩家/位置)
- player_first_name(/SoccerFeed/SoccerDocument/Team/玩家/统计类型=名字)
- player_last_name (/SoccerFeed/SoccerDocument/Team/Player/Stat type = last name)
- player_first_name(/SoccerFeed/SoccerDocument/Team/玩家/统计类型=名字)
- player_birth_place(/SoccerFeed/SoccerDocument/Team/玩家/统计类型=出生地)
- player_preferred_foot (/SoccerFeed/SoccerDocument/Team/玩家/统计类型 = preferred_foot)
...其他球员统计数据(体重、身高、jersey_num、...国家)
我对 /SoccerFeed/SoccerDocument/PlayerChanges
部分玩家更改下的团队和玩家节点不感兴趣
我开始使用 tidyverse 和 xml2 结合 tidyverse 收集球员信息,但我无法获得球队家长信息和球员的不同统计数据
library(xml2)
library(tidyverse)
library(plyr)
x <- read_xml("squads.xml")
players <- x %>%
xml_find_all('/SoccerFeed/SoccerDocument/Team/Player') %>%
map_df(~flatten(c(xml_attrs(.x),
map(xml_children(.x),
~set_names(as.list(xml_text(.x)), xml_name(.x)))))) %>%
type_convert()
由于您使用 xml2
并且需要跨嵌套级别不同的各种数据节点,请考虑 XSLT,旨在转换 [=41] 的专用语言(如 SQL) =] 文件。在 R 中,xslt
包,xml2
的姊妹模块,可以 运行 XSLT 1.0 脚本。 XSLT 的递归、模板性质有助于避免复杂的嵌套循环或在应用层映射,这里是 R。加上 XSLT 是可移植的(如 SQL)并且可以 运行 在 R 之外。
虽然这可能是一个需要学习曲线的全新概念,但它可以将您的 XML 完全扁平化为数据集所需的二维结构。您还将 XML 处理 (XSLT) 与数据处理 (R) 分开。具体来说,仅 Player 级别保留,相应的 Team 数据向下迁移(参见演示)。
XSLT (另存为.xsl,一个特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/SoccerFeed|SoccerDocument">
<xsl:apply-templates select="*"/>
</xsl:template>
<xsl:template match="Team">
<xsl:apply-templates select="Player"/>
</xsl:template>
<xsl:template match="Team/@*">
<xsl:element name="{concat('team_', name(.))}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:template>
<xsl:template match="Player">
<xsl:copy>
<xsl:apply-templates select="ancestor::Team/@*"/>
<xsl:copy-of select="Name|Position"/>
<xsl:apply-templates select="@*|Stat"/>
</xsl:copy>
</xsl:template>
<xsl:template match="Player/@*">
<xsl:element name="{name(.)}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:template>
<xsl:template match="Stat">
<xsl:element name="{@Type}">
<xsl:value-of select="text()"/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
R (得到所有字符类型的数据框)
library(xml2)
library(xslt)
library(dplyr)
# INPUT SOURCE
doc <- read_xml("/path/to/Input.xml")
style <- read_xml("/path/to/Style.xsl", package = "xslt")
# TRANSFORM
new_xml <- xml_xslt(doc, style)
# RETRIEVE Player NODES
recs <- xml_find_all(new_xml, "//Player")
# BIND EACH CHILD TEXT AND NAME TO Player DFs
df_list <- lapply(recs, function(r)
data.frame(rbind(setNames(xml_text(xml_children(r)),
xml_name(xml_children(r)))),
stringsAsFactors = FALSE)
)
# BIND ALL DFs TO SINGLE MASTER DF
final_df <- bind_rows(df_list)
我正在尝试从附件中创建数据框 xml https://1drv.ms/u/s!Am7buNMZi-gwgeBmbk6A-NRIRarjYw?e=Pcgm7c
我需要为所有玩家获取他们的列信息和关于团队(父)的信息
XML 样本
<SoccerFeed timestamp="20190519T183022+0000">
<SoccerDocument Type="SQUADS Latest" competition_code="ES_PL" competition_id="23" competition_name="Spanish La Liga" season_id="2018" season_name="Season 2018/2019">
<Team country="Spain" country_id="4" country_iso="ES" official_club_name="Deportivo Alavés S.A.D." region_id="17" region_name="Europe" short_club_name="Alavés" uID="t173">
<Founded>1921</Founded>
<Name>Alavés</Name>
<Player uID="p91406">
<Name>Fernando Pacheco</Name>
<Position>Goalkeeper</Position>
<Stat Type="first_name">Fernando</Stat>
<Stat Type="last_name">Pacheco</Stat>
<Stat Type="birth_date">1992-05-18</Stat>
<Stat Type="birth_place">Badajoz</Stat>
<Stat Type="first_nationality">Spain</Stat>
<Stat Type="preferred_foot">Left</Stat>
<Stat Type="weight">81</Stat>
<Stat Type="height">186</Stat>
<Stat Type="jersey_num">1</Stat>
<Stat Type="real_position">Goalkeeper</Stat>
<Stat Type="real_position_side">Unknown</Stat>
<Stat Type="join_date">2015-08-07</Stat>
<Stat Type="country">Spain</Stat>
</Player>
<Player uID="p176245">
<Name>Antonio Sivera</Name>
<Position>Goalkeeper</Position>
<Stat Type="first_name">Antonio</Stat>
<Stat Type="last_name">Sivera</Stat>
<Stat Type="birth_date">1996-08-11</Stat>
<Stat Type="birth_place">Jávea</Stat>
<Stat Type="first_nationality">Spain</Stat>
<Stat Type="preferred_foot">Right</Stat>
<Stat Type="weight">75</Stat>
<Stat Type="height">184</Stat>
<Stat Type="jersey_num">13</Stat>
<Stat Type="real_position">Goalkeeper</Stat>
<Stat Type="real_position_side">Unknown</Stat>
<Stat Type="join_date">2017-07-19</Stat>
<Stat Type="country">Spain</Stat>
</Player>
</Team>
<Team city="Madrid" country="Spain" country_id="4" country_iso="ES" official_club_name="Club Atlético de Madrid S.A.D" postal_code="28005" region_id="17" region_name="Europe" short_club_name="Atlético" street="Paseo Virgen del Puerto, 67" uID="t175" web_address="www.clubatleticodemadrid.com/">
<Founded>1903</Founded>
<Name>Atlético de Madrid</Name>
<Player uID="p59981">
<Name>Antonio Adán</Name>
<Position>Goalkeeper</Position>
<Stat Type="first_name">Antonio</Stat>
<Stat Type="last_name">Adán</Stat>
<Stat Type="birth_date">1987-05-13</Stat>
<Stat Type="birth_place">Madrid</Stat>
<Stat Type="first_nationality">Spain</Stat>
<Stat Type="preferred_foot">Left</Stat>
<Stat Type="weight">92</Stat>
<Stat Type="height">190</Stat>
<Stat Type="jersey_num">1</Stat>
<Stat Type="real_position">Goalkeeper</Stat>
<Stat Type="real_position_side">Unknown</Stat>
<Stat Type="join_date">2018-07-10</Stat>
<Stat Type="country">Spain</Stat>
</Player>
<Player uID="p81352">
<Name>Jan Oblak</Name>
<Position>Goalkeeper</Position>
<Stat Type="first_name">Jan</Stat>
<Stat Type="last_name">Oblak</Stat>
<Stat Type="birth_date">1993-01-07</Stat>
<Stat Type="birth_place">Skojfa Loka</Stat>
<Stat Type="first_nationality">Slovenia</Stat>
<Stat Type="preferred_foot">Right</Stat>
<Stat Type="weight">87</Stat>
<Stat Type="height">188</Stat>
<Stat Type="jersey_num">13</Stat>
<Stat Type="real_position">Goalkeeper</Stat>
<Stat Type="real_position_side">Unknown</Stat>
<Stat Type="join_date">2014-07-16</Stat>
<Stat Type="country">Slovenia</Stat>
</Player>
</Team>
</SoccerDocument>
</SoccerFeed>
我想要的栏目
团队专栏
- 国家(SoccerFeed/SoccerDocument/Team 属性)
- country_id(SoccerFeed/SoccerDocument/Team 属性)
- country_iso(SoccerFeed/SoccerDocument/Team 属性)
- official_club_name(SoccerFeed/SoccerDocument/Team 属性)
- region_id(SoccerFeed/SoccerDocument/Team 属性)
- region_name(SoccerFeed/SoccerDocument/Team 属性)
- short_club_name(SoccerFeed/SoccerDocument/Team 属性)
- team_uID(SoccerFeed/SoccerDocument/Team 属性 uID)
- team_name(SoccerFeed/SoccerDocument/Team/姓名)
- team_founded(SoccerFeed/SoccerDocument/Team/成立)
玩家专栏
player_uID (/SoccerFeed/SoccerDocument/Team/玩家)
player_name (/SoccerFeed/SoccerDocument/Team/玩家/姓名)
player_position (/SoccerFeed/SoccerDocument/Team/玩家/位置)
- player_first_name(/SoccerFeed/SoccerDocument/Team/玩家/统计类型=名字)
- player_last_name (/SoccerFeed/SoccerDocument/Team/Player/Stat type = last name)
- player_first_name(/SoccerFeed/SoccerDocument/Team/玩家/统计类型=名字)
- player_birth_place(/SoccerFeed/SoccerDocument/Team/玩家/统计类型=出生地)
- player_preferred_foot (/SoccerFeed/SoccerDocument/Team/玩家/统计类型 = preferred_foot) ...其他球员统计数据(体重、身高、jersey_num、...国家)
我对 /SoccerFeed/SoccerDocument/PlayerChanges
部分玩家更改下的团队和玩家节点不感兴趣我开始使用 tidyverse 和 xml2 结合 tidyverse 收集球员信息,但我无法获得球队家长信息和球员的不同统计数据
library(xml2)
library(tidyverse)
library(plyr)
x <- read_xml("squads.xml")
players <- x %>%
xml_find_all('/SoccerFeed/SoccerDocument/Team/Player') %>%
map_df(~flatten(c(xml_attrs(.x),
map(xml_children(.x),
~set_names(as.list(xml_text(.x)), xml_name(.x)))))) %>%
type_convert()
由于您使用 xml2
并且需要跨嵌套级别不同的各种数据节点,请考虑 XSLT,旨在转换 [=41] 的专用语言(如 SQL) =] 文件。在 R 中,xslt
包,xml2
的姊妹模块,可以 运行 XSLT 1.0 脚本。 XSLT 的递归、模板性质有助于避免复杂的嵌套循环或在应用层映射,这里是 R。加上 XSLT 是可移植的(如 SQL)并且可以 运行 在 R 之外。
虽然这可能是一个需要学习曲线的全新概念,但它可以将您的 XML 完全扁平化为数据集所需的二维结构。您还将 XML 处理 (XSLT) 与数据处理 (R) 分开。具体来说,仅 Player 级别保留,相应的 Team 数据向下迁移(参见演示)。
XSLT (另存为.xsl,一个特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/SoccerFeed|SoccerDocument">
<xsl:apply-templates select="*"/>
</xsl:template>
<xsl:template match="Team">
<xsl:apply-templates select="Player"/>
</xsl:template>
<xsl:template match="Team/@*">
<xsl:element name="{concat('team_', name(.))}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:template>
<xsl:template match="Player">
<xsl:copy>
<xsl:apply-templates select="ancestor::Team/@*"/>
<xsl:copy-of select="Name|Position"/>
<xsl:apply-templates select="@*|Stat"/>
</xsl:copy>
</xsl:template>
<xsl:template match="Player/@*">
<xsl:element name="{name(.)}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:template>
<xsl:template match="Stat">
<xsl:element name="{@Type}">
<xsl:value-of select="text()"/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
R (得到所有字符类型的数据框)
library(xml2)
library(xslt)
library(dplyr)
# INPUT SOURCE
doc <- read_xml("/path/to/Input.xml")
style <- read_xml("/path/to/Style.xsl", package = "xslt")
# TRANSFORM
new_xml <- xml_xslt(doc, style)
# RETRIEVE Player NODES
recs <- xml_find_all(new_xml, "//Player")
# BIND EACH CHILD TEXT AND NAME TO Player DFs
df_list <- lapply(recs, function(r)
data.frame(rbind(setNames(xml_text(xml_children(r)),
xml_name(xml_children(r)))),
stringsAsFactors = FALSE)
)
# BIND ALL DFs TO SINGLE MASTER DF
final_df <- bind_rows(df_list)