循环 df 并检索链接到其他列中唯一值的唯一值

Loop over df and retrieve unique values linked to unique values in other column

我在 excel 中对标签进行了子分类和分类,但我想使其可重现,因此我想将其转换为 R 代码。

我有一个包含 631 行的 df,其中前 15 行看起来像这样。

   IV_label               Subcategory            Category                         
   <chr>                  <chr>                  <chr>                            
 1 light conditions       time of day            exogenous                        
 2 vital status           victim characteristics human involvement 
 3 road type              road type              exogenous                        
 4 reserve density        workload               police discretion                
 5 road type              road type              exogenous                        
 6 surface type           road type              exogenous                        
 7 surface characteristic road type              exogenous                        
 8 light conditions       time of day            exogenous                        
 9 light conditions       time of day            exogenous                        
10 weather                weather type           exogenous                        
11 weather                weather type           exogenous                        
12 weather                weather type           exogenous                        
13 day of the week        day of the week        exogenous                        
14 amount of lanes        road type              exogenous                        
15 amount of lanes        road type              exogenous 

我希望能够将以下内容添加到我的 R 代码中,而无需自己构建列表:

time of day                 <- list(light conditions, ...)
victim characteristics      <- list(vital status, ...)
road type                   <- list(road type, surface type, surface characteristics, amount of lanes, ...) (# notice road type is include only once!)
workload                    <- list(reserve density, ...)
weather type                <- list(weather, ...)
day of the week             <- list(day of the week, ...)
exogenous                   <- list(time of day, road type, weather type, day of the week)
human involvement           <- list(victim characteristics)
police discretion           <- list(workload)

我知道我需要自己对这部分进行样板化:

time of day                 <- list(
victim characteristics      <- list(
road type                   <- list(
workload                    <- list(
weather type                <- list(
day of the week             <- list(
exogenous                   <- list(
human involvement           <- list(
police discretion           <- list(

但我希望能够从控制台复制唯一值并将它们传递到上面的样板文件中。

这里我考虑 edge 出现在同一行、两个连续列中的任何一对术语。我正在使用邻接矩阵 adj 来跟踪边,然后将图形重建为命名列表:

library(purrr)

df <- data.frame(IV_label = c(
                   "light conditions","vital status","road type",
                   "reserve density","road type","surface type",
                   "surface characteristic","light conditions","light conditions",
                   "weather","weather","weather",
                   "day of the week","amount of lanes","amount of lanes"),
                 Subcategory = c(
                   "time of day","victim characteristics","road type",
                   "workload","road type","road type",
                   "road type","time of day","time of day",
                   "weather type","weather type","weather type",
                   "day of the week","road type","road type"),
                 Category = c(
                   "exogenous","human involvement","exogenous",
                   "police discretion","exogenous","exogenous",
                   "exogenous","exogenous","exogenous",
                   "exogenous","exogenous","exogenous",
                   "exogenous","exogenous","exogenous"))



names <- c("IV_label", "Subcategory", "Category") |>
  purrr::map(~pull(df, .x)) |>
       purrr::reduce(union)

## adjacency matrix
adj <- matrix(0,
              nrow = length(names),
              ncol = length(names),
              dimnames = list(names, names))

adj[cbind(df[,2], df[,1])] <- 1
adj[cbind(df[,3], df[,2])] <- 1

setNames(asplit(adj, 1),names) |>
  purrr::map(~names[which(.x == 1)]) |>
  purrr::keep(~length(.x) > 0)

输出:

$`road type`
[1] "road type"              "surface type"           "surface characteristic"
[4] "amount of lanes"       

$`day of the week`
[1] "day of the week"

$`time of day`
[1] "light conditions"

$`victim characteristics`
[1] "vital status"

$workload
[1] "reserve density"

$`weather type`
[1] "weather"

$exogenous
[1] "road type"       "day of the week" "time of day"     "weather type"   

$`human involvement`
[1] "victim characteristics"

$`police discretion`
[1] "workload"

您可能想要取消设置 adj 的对角线以避免自引用边:

adj[row(adj) == col(adj)] <- 0

setNames(asplit(adj, 1),names) |>
  purrr::map(~names[which(.x == 1)]) |>
  purrr::keep(~length(.x) > 0)

输出:

$`road type`
[1] "surface type"           "surface characteristic" "amount of lanes"       

$`time of day`
[1] "light conditions"

$`victim characteristics`
[1] "vital status"

$workload
[1] "reserve density"

$`weather type`
[1] "weather"

$exogenous
[1] "road type"       "day of the week" "time of day"     "weather type"   

$`human involvement`
[1] "victim characteristics"

$`police discretion`
[1] "workload"