如何使用 xml2 和 purrr 在不同级别上提取 xml_attr 和 xml_text?
How to extract xml_attr and xml_text on different levels with xml2 and purrr?
我想从 XML 文件中提取信息并将其转换为数据框。
信息作为 XML 文本和 XML 属性存储在嵌套节点中:
示例结构:
<xmlnode node-id = "Text about xmlnode">
<xmlsubnode subnode-id = "123">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
<xmlsubnode subnode-id = "456">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
</xmlnode>
<xmlnode node-id = "Text about xmlnode">
<xmlsubnode subnode-id = "123">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
<xmlsubnode subnode-id = "456">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
</xmlnode>
我想获取这些信息:
* node-id (attribute)
* subnode-id (attribute)
* text in `xmlsubnodenode` (text)
我需要一个像这样的长格式数据框:
node-id subnode-id text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
我尝试遵循 Jenny Bryans 的方法 "How to tame XML with nested data frames and purrr",但它只适用于第一层。
xml <- xml2::read_xml("input/example.xml")
rows <-
xml %>%
xml_find_all("//xmlnode")
rows_df <- data_frame(row = seq_along(rows), nodeset = rows)
rows_df %>%
mutate(node_id = nodeset %>% map(~ xml_attr(., "node-id"))) %>%
select(row, node_id) %>%
unnest()
您有什么想法可以通过 purrr
获取这些信息吗?
无需展开/向另一个数据框添加行的方法:为每个 subsubnode
创建一个包含一行的数据框,并将 purrr
与 xml2
一起使用到 select 并提取 xmlsubnode
parent 和 xmlnode
祖先的值。
工作样本:
library(dplyr)
library(xml2)
library(purrr)
library(tidyr)
xml <- xml2::read_xml("input/example.xml")
rows <- xml %>% xml_find_all("//xmlsubsubnode")
rows_df <- data_frame(node = rows) %>%
mutate(node_id = node %>% map(~ xml_find_first(., "ancestor::xmlnode")) %>% map(~ xml_attr(., "node-id"))) %>%
mutate(subnode_id = node %>% map(~ xml_parent(.)) %>% map(~ xml_attr(., "subnode-id"))) %>%
mutate(text = node %>% map(~ xml_text(.))) %>%
select(-node)
我想从 XML 文件中提取信息并将其转换为数据框。
信息作为 XML 文本和 XML 属性存储在嵌套节点中:
示例结构:
<xmlnode node-id = "Text about xmlnode">
<xmlsubnode subnode-id = "123">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
<xmlsubnode subnode-id = "456">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
</xmlnode>
<xmlnode node-id = "Text about xmlnode">
<xmlsubnode subnode-id = "123">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
<xmlsubnode subnode-id = "456">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
</xmlnode>
我想获取这些信息:
* node-id (attribute)
* subnode-id (attribute)
* text in `xmlsubnodenode` (text)
我需要一个像这样的长格式数据框:
node-id subnode-id text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
我尝试遵循 Jenny Bryans 的方法 "How to tame XML with nested data frames and purrr",但它只适用于第一层。
xml <- xml2::read_xml("input/example.xml")
rows <-
xml %>%
xml_find_all("//xmlnode")
rows_df <- data_frame(row = seq_along(rows), nodeset = rows)
rows_df %>%
mutate(node_id = nodeset %>% map(~ xml_attr(., "node-id"))) %>%
select(row, node_id) %>%
unnest()
您有什么想法可以通过 purrr
获取这些信息吗?
无需展开/向另一个数据框添加行的方法:为每个 subsubnode
创建一个包含一行的数据框,并将 purrr
与 xml2
一起使用到 select 并提取 xmlsubnode
parent 和 xmlnode
祖先的值。
工作样本:
library(dplyr)
library(xml2)
library(purrr)
library(tidyr)
xml <- xml2::read_xml("input/example.xml")
rows <- xml %>% xml_find_all("//xmlsubsubnode")
rows_df <- data_frame(node = rows) %>%
mutate(node_id = node %>% map(~ xml_find_first(., "ancestor::xmlnode")) %>% map(~ xml_attr(., "node-id"))) %>%
mutate(subnode_id = node %>% map(~ xml_parent(.)) %>% map(~ xml_attr(., "subnode-id"))) %>%
mutate(text = node %>% map(~ xml_text(.))) %>%
select(-node)