如何在给定 R 中的父子关系的情况下展平分层数据结构

How to flatten a hierarchical data structure given parent child relationships in R

我有描述亲子关系的数据:

df <- tibble::tribble(
       ~Child,     ~Parent,
      "Fruit",      "Food",
  "Vegetable",      "Food",
      "Apple",     "Fruit",
     "Banana",     "Fruit",
       "Pear",     "Fruit",
     "Carrot", "Vegetable",
     "Celery", "Vegetable",
       "Bike",  "Not Food",
        "Car",  "Not Food"
  )
df
#> # A tibble: 9 x 2
#>   Child     Parent   
#>   <chr>     <chr>    
#> 1 Fruit     Food     
#> 2 Vegetable Food     
#> 3 Apple     Fruit    
#> 4 Banana    Fruit    
#> 5 Pear      Fruit    
#> 6 Carrot    Vegetable
#> 7 Celery    Vegetable
#> 8 Bike      Not Food 
#> 9 Car       Not Food

在视觉上,这看起来像:

最终,我想要的结果是将其“扁平化”为看起来更像这样的结构:

results <- tibble::tribble(
             ~Level.03, ~Level.02,  ~Level.01,
               "Apple",   "Fruit",     "Food",
              "Banana",   "Fruit",     "Food",
                "Pear",   "Fruit",     "Food",
                    NA,    "Bike", "Not Food",
                    NA,     "Car", "Not Food"
             )
results
#> # A tibble: 5 x 3
#>   Level.03 Level.02 Level.01
#>   <chr>    <chr>    <chr>   
#> 1 Apple    Fruit    Food    
#> 2 Banana   Fruit    Food    
#> 3 Pear     Fruit    Food    
#> 4 <NA>     Bike     Not Food
#> 5 <NA>     Car      Not Food

注意:并非所有元素都具有所有级别。例如,bikecar 没有 Level.03 个元素。

我觉得有一种方法可以使用 tidyr 或来自 jsonlite 的某种类型的 next/unnest 函数优雅地完成此操作?我从递归连接开始,但我觉得我正在重新发明轮子,而且可能有一种直接的方法。

在这种特殊情况下,您可以通过进行一些连接和绑定来获得所需的结果:

library(dplyr)

df2 <- df %>% 
  inner_join(df, 
             by = c("Parent" = "Child"),
             suffix = c("", "_top")) 

df %>% 
  anti_join(df2) %>%
  select(Parent_top = Parent, Parent = Child) %>% 
  bind_rows(df2) %>%
  group_by(Parent_top, Parent) %>% 
  filter(!is.na(Child) | n() == 1) %>% 
  select(Level_01 = Parent_top, Level_02 = Parent, Level_03 = Child)

但我认为这种方式对于 larger/other 数据集来说不是很稳定。也许只是对该数据集使用循环会给你一个更好的答案。

这是一个带有 while 循环的函数:

fun <- function(s){
  i <- 1
  while(i<=length(s)){
    if(any(s[[i]] %in% names(s)))
    {
      nms <- s[[i]]
      s[[i]] <- stack(s[nms])
      s[nms] <- NULL
    }
    else
      s[[i]] <- data.frame(values = NA, ind = s[[i]])
    i <- i+1
  }
  s
}

dplyr::bind_rows(fun(unstack(df)), .id = 'Level.01')[c(2:3,1)]
 values       ind Level.01
1  Apple     Fruit     Food
2 Banana     Fruit     Food
3   Pear     Fruit     Food
4 Carrot Vegetable     Food
5 Celery Vegetable     Food
6   <NA>      Bike Not Food
7   <NA>       Car Not Food

如果你有更多的级别,你可以概括这个

我会把它想成图形问题。对原始数据进行 2 处更改以适应这种方法:切换列的顺序以显示层次结构方向(父到子),并添加一个链接的顶级节点(我称之为“项目”)对主要群体(食物而不是食物)。您可能可以通过编程方式完成第二部分,但这似乎比它的价值更痛苦。

library(dplyr)

df <- tibble::tribble(
  ~Child,     ~Parent,
  "Fruit",      "Food",
  "Vegetable",      "Food",
  "Apple",     "Fruit",
  "Banana",     "Fruit",
  "Pear",     "Fruit",
  "Carrot", "Vegetable",
  "Celery", "Vegetable",
  "Bike",  "Not Food",
  "Car",  "Not Food"
) %>%
  select(Parent, Child) %>%
  add_row(Parent = "Items", Child = c("Food", "Not Food"))

第一种方法是使用data.tree,它是为处理这种类型的数据而设计的。它创建一个树表示,然后您可以将其转换回具有几种形状之一的数据框。

library(data.tree)

g1 <- FromDataFrameNetwork(df)
g1
#>             levelName
#> 1  Items             
#> 2   ¦--Food          
#> 3   ¦   ¦--Fruit     
#> 4   ¦   ¦   ¦--Apple 
#> 5   ¦   ¦   ¦--Banana
#> 6   ¦   ¦   °--Pear  
#> 7   ¦   °--Vegetable 
#> 8   ¦       ¦--Carrot
#> 9   ¦       °--Celery
#> 10  °--Not Food      
#> 11      ¦--Bike      
#> 12      °--Car
ToDataFrameTypeCol(g1)
#>   level_1  level_2   level_3 level_4
#> 1   Items     Food     Fruit   Apple
#> 2   Items     Food     Fruit  Banana
#> 3   Items     Food     Fruit    Pear
#> 4   Items     Food Vegetable  Carrot
#> 5   Items     Food Vegetable  Celery
#> 6   Items Not Food      Bike    <NA>
#> 7   Items Not Food       Car    <NA>

第二种方法比较复杂,可能只有在您需要执行其他图形操作时才有意义。用igraph做一个图,然后从最上面的节点Items开始得到图中所有的路径。这给了你一个顶点对象列表;对于其中的每一个,提取 ID。下面是其中的一个示例。

library(igraph)
g2 <- graph_from_data_frame(df)
all_simple_paths(g2, from = "Items") %>%
  purrr::map(as_ids) %>%
  `[[`(4)
#> [1] "Items"  "Food"   "Fruit"  "Banana"

从所有这些向量创建数据框,绑定并重塑以获得每个级别一列。

all_simple_paths(g2, from = "Items") %>%
  purrr::map(as_ids) %>%
  purrr::map_dfr(tibble::enframe, .id = "row") %>%
  tidyr::pivot_wider(id_cols = row, names_prefix = "level_")
#> # A tibble: 11 × 5
#>    row   level_1 level_2  level_3   level_4
#>    <chr> <chr>   <chr>    <chr>     <chr>  
#>  1 1     Items   Food     <NA>      <NA>   
#>  2 2     Items   Food     Fruit     <NA>   
#>  3 3     Items   Food     Fruit     Apple  
#>  4 4     Items   Food     Fruit     Banana 
#>  5 5     Items   Food     Fruit     Pear   
#>  6 6     Items   Food     Vegetable <NA>   
#>  7 7     Items   Food     Vegetable Carrot 
#>  8 8     Items   Food     Vegetable Celery 
#>  9 9     Items   Not Food <NA>      <NA>   
#> 10 10    Items   Not Food Bike      <NA>   
#> 11 11    Items   Not Food Car       <NA>

无论哪种情况,如果您实际上不需要,请删除第 1 级列。