来自多个 XML 个文件的数据框
Data frame from multiple XML files
我有一个包含多个 XML 文件的文件夹。所有文件都具有相同的基本结构。然而,每个文件实际上包含与分布在 16 个父节点中的单个实体相关的数据。有的节点有子节点,有的甚至有g运行dchildren,有的g运行dchildren.
我想从多个文件创建一个数据框,只选择 nodes/children/grandchildren。
作为第一步,我读取了一个 XML 文件作为列表。然后运行几行代码将需要的数据放到一个vector中。最终,将矢量转换为数据帧,就像我想要的那样。
此代码:
library(xml2)
library(tidyverse)
x = as_list(read_xml("ACTRN12605000003673.xml"))
tmp = c(ACTR_Number = as.character(x$ANZCTR_Trial$actrnumber),
primary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$primarysponsortype),
primary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorname),
primary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$primarysponsoraddress),
primary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorcountry),
funding_source_type = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingtype),
funding_source_name = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingname),
funding_source_address = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingaddress),
funding_source_country = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingcountry),
secondary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsortype),
secondary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorname),
secondary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsoraddress),
secondary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorcountry))
tmp = as.list(tmp)
tmp = as.data.frame(tmp)
对于下一步,我尝试同时处理 2 个 XML 文件。我尝试了以下代码同时读取两个文件。但是,除此之外,我不知道如何继续。
all_files = list.files(pattern=".xml", path = getwd(), full.names = TRUE)
x = lapply(all_files, read_xml)
class(x)
样本文件here
R 支持列表中的列表。当您定义一个列表时,它将包含您所有的 tmp 列表。然后将所有文件放入此列表后,您可以制作数据框或小标题。然后你的 tibble 包含带有列表(tmp)的列。然后您可以使用库中的地图函数解决它们(purrr)
我们可以将您的数据收集过程封装成一个函数。由于每个 XML 文件似乎代表所需输出数据框中的一行,因此我们可以使用 purrr::map_dfr
逐行构建数据。我稍微修改了你的代码。见下文:
get_row_data <- function(file_name) {
xml <- xml2::read_xml(file_name)
read_text <- \(x, xpath) rvest::html_elements(x, xpath = xpath) |> rvest::html_text(TRUE)
xpaths <- c(
ACTR_Number = "/ANZCTR_Trial/actrnumber",
primary_sponsor_type = "/ANZCTR_Trial/sponsorship/primarysponsortype",
primary_sponsor_name = "/ANZCTR_Trial/sponsorship/primarysponsorname",
primary_sponsor_address = "/ANZCTR_Trial/sponsorship/primarysponsoraddress",
primary_sponsor_country = "/ANZCTR_Trial/sponsorship/primarysponsorcountry",
funding_source_type = "/ANZCTR_Trial/sponsorship/fundingsource/fundingtype",
funding_source_name = "/ANZCTR_Trial/sponsorship/fundingsource/fundingname",
funding_source_address = "/ANZCTR_Trial/sponsorship/fundingsource/fundingaddress",
funding_source_country = "/ANZCTR_Trial/sponsorship/fundingsource/fundingcountry",
secondary_sponsor_type = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsortype",
secondary_sponsor_name = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorname",
secondary_sponsor_address = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsoraddress",
secondary_sponsor_country = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorcountry"
)
x <- read_text(xml, paste0(xpaths, collapse = " | "))
names(x) <- names(xpaths)
as_tibble(as.list(x))
}
all_files <- list.files(pattern = ".xml", path = getwd(), full.names = TRUE)
purrr::map_dfr(all_files, get_row_data)
输出
# A tibble: 2 x 13
ACTR_Number primary_sponsor_~ primary_sponsor_n~ primary_sponsor_~ primary_sponsor~ funding_source_~ funding_source_~
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ACTRN12605000003673 Hospital Barwon Health "272-322 Ryrie S~ Australia Commercial sect~ Astra Zeneca
2 ACTRN12605000025639 Other Collaborat~ Australian Gastro~ "88 Mallett St\n~ Australia Commercial sect~ Roche Products ~
# ... with 6 more variables: funding_source_address <chr>, funding_source_country <chr>, secondary_sponsor_type <chr>,
# secondary_sponsor_name <chr>, secondary_sponsor_address <chr>, secondary_sponsor_country <chr>
我有一个包含多个 XML 文件的文件夹。所有文件都具有相同的基本结构。然而,每个文件实际上包含与分布在 16 个父节点中的单个实体相关的数据。有的节点有子节点,有的甚至有g运行dchildren,有的g运行dchildren.
我想从多个文件创建一个数据框,只选择 nodes/children/grandchildren。
作为第一步,我读取了一个 XML 文件作为列表。然后运行几行代码将需要的数据放到一个vector中。最终,将矢量转换为数据帧,就像我想要的那样。
此代码:
library(xml2)
library(tidyverse)
x = as_list(read_xml("ACTRN12605000003673.xml"))
tmp = c(ACTR_Number = as.character(x$ANZCTR_Trial$actrnumber),
primary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$primarysponsortype),
primary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorname),
primary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$primarysponsoraddress),
primary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorcountry),
funding_source_type = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingtype),
funding_source_name = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingname),
funding_source_address = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingaddress),
funding_source_country = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingcountry),
secondary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsortype),
secondary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorname),
secondary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsoraddress),
secondary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorcountry))
tmp = as.list(tmp)
tmp = as.data.frame(tmp)
对于下一步,我尝试同时处理 2 个 XML 文件。我尝试了以下代码同时读取两个文件。但是,除此之外,我不知道如何继续。
all_files = list.files(pattern=".xml", path = getwd(), full.names = TRUE)
x = lapply(all_files, read_xml)
class(x)
样本文件here
R 支持列表中的列表。当您定义一个列表时,它将包含您所有的 tmp 列表。然后将所有文件放入此列表后,您可以制作数据框或小标题。然后你的 tibble 包含带有列表(tmp)的列。然后您可以使用库中的地图函数解决它们(purrr)
我们可以将您的数据收集过程封装成一个函数。由于每个 XML 文件似乎代表所需输出数据框中的一行,因此我们可以使用 purrr::map_dfr
逐行构建数据。我稍微修改了你的代码。见下文:
get_row_data <- function(file_name) {
xml <- xml2::read_xml(file_name)
read_text <- \(x, xpath) rvest::html_elements(x, xpath = xpath) |> rvest::html_text(TRUE)
xpaths <- c(
ACTR_Number = "/ANZCTR_Trial/actrnumber",
primary_sponsor_type = "/ANZCTR_Trial/sponsorship/primarysponsortype",
primary_sponsor_name = "/ANZCTR_Trial/sponsorship/primarysponsorname",
primary_sponsor_address = "/ANZCTR_Trial/sponsorship/primarysponsoraddress",
primary_sponsor_country = "/ANZCTR_Trial/sponsorship/primarysponsorcountry",
funding_source_type = "/ANZCTR_Trial/sponsorship/fundingsource/fundingtype",
funding_source_name = "/ANZCTR_Trial/sponsorship/fundingsource/fundingname",
funding_source_address = "/ANZCTR_Trial/sponsorship/fundingsource/fundingaddress",
funding_source_country = "/ANZCTR_Trial/sponsorship/fundingsource/fundingcountry",
secondary_sponsor_type = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsortype",
secondary_sponsor_name = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorname",
secondary_sponsor_address = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsoraddress",
secondary_sponsor_country = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorcountry"
)
x <- read_text(xml, paste0(xpaths, collapse = " | "))
names(x) <- names(xpaths)
as_tibble(as.list(x))
}
all_files <- list.files(pattern = ".xml", path = getwd(), full.names = TRUE)
purrr::map_dfr(all_files, get_row_data)
输出
# A tibble: 2 x 13
ACTR_Number primary_sponsor_~ primary_sponsor_n~ primary_sponsor_~ primary_sponsor~ funding_source_~ funding_source_~
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ACTRN12605000003673 Hospital Barwon Health "272-322 Ryrie S~ Australia Commercial sect~ Astra Zeneca
2 ACTRN12605000025639 Other Collaborat~ Australian Gastro~ "88 Mallett St\n~ Australia Commercial sect~ Roche Products ~
# ... with 6 more variables: funding_source_address <chr>, funding_source_country <chr>, secondary_sponsor_type <chr>,
# secondary_sponsor_name <chr>, secondary_sponsor_address <chr>, secondary_sponsor_country <chr>