如何在给定 R 中的父子关系的情况下展平分层数据结构
How to flatten a hierarchical data structure given parent child relationships in R
我有描述亲子关系的数据:
df <- tibble::tribble(
~Child, ~Parent,
"Fruit", "Food",
"Vegetable", "Food",
"Apple", "Fruit",
"Banana", "Fruit",
"Pear", "Fruit",
"Carrot", "Vegetable",
"Celery", "Vegetable",
"Bike", "Not Food",
"Car", "Not Food"
)
df
#> # A tibble: 9 x 2
#> Child Parent
#> <chr> <chr>
#> 1 Fruit Food
#> 2 Vegetable Food
#> 3 Apple Fruit
#> 4 Banana Fruit
#> 5 Pear Fruit
#> 6 Carrot Vegetable
#> 7 Celery Vegetable
#> 8 Bike Not Food
#> 9 Car Not Food
在视觉上,这看起来像:
最终,我想要的结果是将其“扁平化”为看起来更像这样的结构:
results <- tibble::tribble(
~Level.03, ~Level.02, ~Level.01,
"Apple", "Fruit", "Food",
"Banana", "Fruit", "Food",
"Pear", "Fruit", "Food",
NA, "Bike", "Not Food",
NA, "Car", "Not Food"
)
results
#> # A tibble: 5 x 3
#> Level.03 Level.02 Level.01
#> <chr> <chr> <chr>
#> 1 Apple Fruit Food
#> 2 Banana Fruit Food
#> 3 Pear Fruit Food
#> 4 <NA> Bike Not Food
#> 5 <NA> Car Not Food
注意:并非所有元素都具有所有级别。例如,bike
和 car
没有 Level.03
个元素。
我觉得有一种方法可以使用 tidyr
或来自 jsonlite
的某种类型的 next/unnest
函数优雅地完成此操作?我从递归连接开始,但我觉得我正在重新发明轮子,而且可能有一种直接的方法。
在这种特殊情况下,您可以通过进行一些连接和绑定来获得所需的结果:
library(dplyr)
df2 <- df %>%
inner_join(df,
by = c("Parent" = "Child"),
suffix = c("", "_top"))
df %>%
anti_join(df2) %>%
select(Parent_top = Parent, Parent = Child) %>%
bind_rows(df2) %>%
group_by(Parent_top, Parent) %>%
filter(!is.na(Child) | n() == 1) %>%
select(Level_01 = Parent_top, Level_02 = Parent, Level_03 = Child)
但我认为这种方式对于 larger/other 数据集来说不是很稳定。也许只是对该数据集使用循环会给你一个更好的答案。
这是一个带有 while 循环的函数:
fun <- function(s){
i <- 1
while(i<=length(s)){
if(any(s[[i]] %in% names(s)))
{
nms <- s[[i]]
s[[i]] <- stack(s[nms])
s[nms] <- NULL
}
else
s[[i]] <- data.frame(values = NA, ind = s[[i]])
i <- i+1
}
s
}
dplyr::bind_rows(fun(unstack(df)), .id = 'Level.01')[c(2:3,1)]
values ind Level.01
1 Apple Fruit Food
2 Banana Fruit Food
3 Pear Fruit Food
4 Carrot Vegetable Food
5 Celery Vegetable Food
6 <NA> Bike Not Food
7 <NA> Car Not Food
如果你有更多的级别,你可以概括这个
我会把它想成图形问题。对原始数据进行 2 处更改以适应这种方法:切换列的顺序以显示层次结构方向(父到子),并添加一个链接的顶级节点(我称之为“项目”)对主要群体(食物而不是食物)。您可能可以通过编程方式完成第二部分,但这似乎比它的价值更痛苦。
library(dplyr)
df <- tibble::tribble(
~Child, ~Parent,
"Fruit", "Food",
"Vegetable", "Food",
"Apple", "Fruit",
"Banana", "Fruit",
"Pear", "Fruit",
"Carrot", "Vegetable",
"Celery", "Vegetable",
"Bike", "Not Food",
"Car", "Not Food"
) %>%
select(Parent, Child) %>%
add_row(Parent = "Items", Child = c("Food", "Not Food"))
第一种方法是使用data.tree
,它是为处理这种类型的数据而设计的。它创建一个树表示,然后您可以将其转换回具有几种形状之一的数据框。
library(data.tree)
g1 <- FromDataFrameNetwork(df)
g1
#> levelName
#> 1 Items
#> 2 ¦--Food
#> 3 ¦ ¦--Fruit
#> 4 ¦ ¦ ¦--Apple
#> 5 ¦ ¦ ¦--Banana
#> 6 ¦ ¦ °--Pear
#> 7 ¦ °--Vegetable
#> 8 ¦ ¦--Carrot
#> 9 ¦ °--Celery
#> 10 °--Not Food
#> 11 ¦--Bike
#> 12 °--Car
ToDataFrameTypeCol(g1)
#> level_1 level_2 level_3 level_4
#> 1 Items Food Fruit Apple
#> 2 Items Food Fruit Banana
#> 3 Items Food Fruit Pear
#> 4 Items Food Vegetable Carrot
#> 5 Items Food Vegetable Celery
#> 6 Items Not Food Bike <NA>
#> 7 Items Not Food Car <NA>
第二种方法比较复杂,可能只有在您需要执行其他图形操作时才有意义。用igraph
做一个图,然后从最上面的节点Items开始得到图中所有的路径。这给了你一个顶点对象列表;对于其中的每一个,提取 ID。下面是其中的一个示例。
library(igraph)
g2 <- graph_from_data_frame(df)
all_simple_paths(g2, from = "Items") %>%
purrr::map(as_ids) %>%
`[[`(4)
#> [1] "Items" "Food" "Fruit" "Banana"
从所有这些向量创建数据框,绑定并重塑以获得每个级别一列。
all_simple_paths(g2, from = "Items") %>%
purrr::map(as_ids) %>%
purrr::map_dfr(tibble::enframe, .id = "row") %>%
tidyr::pivot_wider(id_cols = row, names_prefix = "level_")
#> # A tibble: 11 × 5
#> row level_1 level_2 level_3 level_4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Items Food <NA> <NA>
#> 2 2 Items Food Fruit <NA>
#> 3 3 Items Food Fruit Apple
#> 4 4 Items Food Fruit Banana
#> 5 5 Items Food Fruit Pear
#> 6 6 Items Food Vegetable <NA>
#> 7 7 Items Food Vegetable Carrot
#> 8 8 Items Food Vegetable Celery
#> 9 9 Items Not Food <NA> <NA>
#> 10 10 Items Not Food Bike <NA>
#> 11 11 Items Not Food Car <NA>
无论哪种情况,如果您实际上不需要,请删除第 1 级列。
我有描述亲子关系的数据:
df <- tibble::tribble(
~Child, ~Parent,
"Fruit", "Food",
"Vegetable", "Food",
"Apple", "Fruit",
"Banana", "Fruit",
"Pear", "Fruit",
"Carrot", "Vegetable",
"Celery", "Vegetable",
"Bike", "Not Food",
"Car", "Not Food"
)
df
#> # A tibble: 9 x 2
#> Child Parent
#> <chr> <chr>
#> 1 Fruit Food
#> 2 Vegetable Food
#> 3 Apple Fruit
#> 4 Banana Fruit
#> 5 Pear Fruit
#> 6 Carrot Vegetable
#> 7 Celery Vegetable
#> 8 Bike Not Food
#> 9 Car Not Food
在视觉上,这看起来像:
最终,我想要的结果是将其“扁平化”为看起来更像这样的结构:
results <- tibble::tribble(
~Level.03, ~Level.02, ~Level.01,
"Apple", "Fruit", "Food",
"Banana", "Fruit", "Food",
"Pear", "Fruit", "Food",
NA, "Bike", "Not Food",
NA, "Car", "Not Food"
)
results
#> # A tibble: 5 x 3
#> Level.03 Level.02 Level.01
#> <chr> <chr> <chr>
#> 1 Apple Fruit Food
#> 2 Banana Fruit Food
#> 3 Pear Fruit Food
#> 4 <NA> Bike Not Food
#> 5 <NA> Car Not Food
注意:并非所有元素都具有所有级别。例如,bike
和 car
没有 Level.03
个元素。
我觉得有一种方法可以使用 tidyr
或来自 jsonlite
的某种类型的 next/unnest
函数优雅地完成此操作?我从递归连接开始,但我觉得我正在重新发明轮子,而且可能有一种直接的方法。
在这种特殊情况下,您可以通过进行一些连接和绑定来获得所需的结果:
library(dplyr)
df2 <- df %>%
inner_join(df,
by = c("Parent" = "Child"),
suffix = c("", "_top"))
df %>%
anti_join(df2) %>%
select(Parent_top = Parent, Parent = Child) %>%
bind_rows(df2) %>%
group_by(Parent_top, Parent) %>%
filter(!is.na(Child) | n() == 1) %>%
select(Level_01 = Parent_top, Level_02 = Parent, Level_03 = Child)
但我认为这种方式对于 larger/other 数据集来说不是很稳定。也许只是对该数据集使用循环会给你一个更好的答案。
这是一个带有 while 循环的函数:
fun <- function(s){
i <- 1
while(i<=length(s)){
if(any(s[[i]] %in% names(s)))
{
nms <- s[[i]]
s[[i]] <- stack(s[nms])
s[nms] <- NULL
}
else
s[[i]] <- data.frame(values = NA, ind = s[[i]])
i <- i+1
}
s
}
dplyr::bind_rows(fun(unstack(df)), .id = 'Level.01')[c(2:3,1)]
values ind Level.01
1 Apple Fruit Food
2 Banana Fruit Food
3 Pear Fruit Food
4 Carrot Vegetable Food
5 Celery Vegetable Food
6 <NA> Bike Not Food
7 <NA> Car Not Food
如果你有更多的级别,你可以概括这个
我会把它想成图形问题。对原始数据进行 2 处更改以适应这种方法:切换列的顺序以显示层次结构方向(父到子),并添加一个链接的顶级节点(我称之为“项目”)对主要群体(食物而不是食物)。您可能可以通过编程方式完成第二部分,但这似乎比它的价值更痛苦。
library(dplyr)
df <- tibble::tribble(
~Child, ~Parent,
"Fruit", "Food",
"Vegetable", "Food",
"Apple", "Fruit",
"Banana", "Fruit",
"Pear", "Fruit",
"Carrot", "Vegetable",
"Celery", "Vegetable",
"Bike", "Not Food",
"Car", "Not Food"
) %>%
select(Parent, Child) %>%
add_row(Parent = "Items", Child = c("Food", "Not Food"))
第一种方法是使用data.tree
,它是为处理这种类型的数据而设计的。它创建一个树表示,然后您可以将其转换回具有几种形状之一的数据框。
library(data.tree)
g1 <- FromDataFrameNetwork(df)
g1
#> levelName
#> 1 Items
#> 2 ¦--Food
#> 3 ¦ ¦--Fruit
#> 4 ¦ ¦ ¦--Apple
#> 5 ¦ ¦ ¦--Banana
#> 6 ¦ ¦ °--Pear
#> 7 ¦ °--Vegetable
#> 8 ¦ ¦--Carrot
#> 9 ¦ °--Celery
#> 10 °--Not Food
#> 11 ¦--Bike
#> 12 °--Car
ToDataFrameTypeCol(g1)
#> level_1 level_2 level_3 level_4
#> 1 Items Food Fruit Apple
#> 2 Items Food Fruit Banana
#> 3 Items Food Fruit Pear
#> 4 Items Food Vegetable Carrot
#> 5 Items Food Vegetable Celery
#> 6 Items Not Food Bike <NA>
#> 7 Items Not Food Car <NA>
第二种方法比较复杂,可能只有在您需要执行其他图形操作时才有意义。用igraph
做一个图,然后从最上面的节点Items开始得到图中所有的路径。这给了你一个顶点对象列表;对于其中的每一个,提取 ID。下面是其中的一个示例。
library(igraph)
g2 <- graph_from_data_frame(df)
all_simple_paths(g2, from = "Items") %>%
purrr::map(as_ids) %>%
`[[`(4)
#> [1] "Items" "Food" "Fruit" "Banana"
从所有这些向量创建数据框,绑定并重塑以获得每个级别一列。
all_simple_paths(g2, from = "Items") %>%
purrr::map(as_ids) %>%
purrr::map_dfr(tibble::enframe, .id = "row") %>%
tidyr::pivot_wider(id_cols = row, names_prefix = "level_")
#> # A tibble: 11 × 5
#> row level_1 level_2 level_3 level_4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Items Food <NA> <NA>
#> 2 2 Items Food Fruit <NA>
#> 3 3 Items Food Fruit Apple
#> 4 4 Items Food Fruit Banana
#> 5 5 Items Food Fruit Pear
#> 6 6 Items Food Vegetable <NA>
#> 7 7 Items Food Vegetable Carrot
#> 8 8 Items Food Vegetable Celery
#> 9 9 Items Not Food <NA> <NA>
#> 10 10 Items Not Food Bike <NA>
#> 11 11 Items Not Food Car <NA>
无论哪种情况,如果您实际上不需要,请删除第 1 级列。