计算复杂文件夹结构中每个文件夹有多少个文件夹?

Computing how many folders each folder has in a complex folder structure?

考虑以下 tree

library(data.tree)

acme <- Node$new("Acme Inc.")
    accounting <- acme$AddChild("Accounting")
        software <- accounting$AddChild("New Software")
        standards <- accounting$AddChild("New Accounting Standards")
    research <- acme$AddChild("Research")
        newProductLine <- research$AddChild("New Product Line")
        newLabs <- research$AddChild("New Labs")
    it <- acme$AddChild("IT")
        outsource <- it$AddChild("Outsource")
        agile <- it$AddChild("Go agile")
        goToR <- it$AddChild("Switch to R")

然后我想计算 averageBranchingFactor:

averageBranchingFactor(acme)

这会产生2.5

但是,出于各种原因,我希望能够获得所有分支因子,而不仅仅是平均分支因子。例如,我需要它来统计比较两个文件结构在平均分支因子之间的显着差异。

根据 manual for data.treeAverageBranchingFactor() 函数执行以下操作:"calculate the average number of branches each non-leaf has." 因此,我首先尝试了以下操作:

acme.df <- ToDataFrameTree(acme, "averageBranchingFactor")
mean(acme.df$averageBranchingFactor[acme.df$averageBranchingFactor>0])

这会产生 2.375,然后我会尝试一个更简单的版本:

mean(acme.df$averageBranchingFactor)

这会产生 0.8636364

我如何得出所有单个分支因子的平均值为 2.5

理想情况下,我想创建一个 data.frame 来列出每个文件夹,并带有一个变量,其中列出了每个文件夹的分支因子。例如,我有这个非常简单的文件夹结构:

top_level_folder
    sub_folder_1
    sub_folder_2
         sub_folder_3

回答这个问题将涉及创建如下所示的输出:

Folders             Subfolders (BranchingFactor)
top_level_folder    2
sub_folder_1        0
sub_folder_2        1
sub_folder_3        0

第一列可以通过调用list.dirs("/Users/username/Downloads/top_level/")简单地生成,但我不知道如何生成第二列。请注意,第二列是非递归的,这意味着不计算子文件夹中的文件夹(即 top_level_folder 仅包含 2 个子文件夹,即使 sub_folder_2 包含另一个文件夹 sub_folder_2 )。

如果您想查看您的解决方案是否可扩展,请下载 Rails 代码库:https://github.com/rails/rails/archive/master.zip 并在 Rails 更复杂的文件结构上尝试。

您可以简单地沿着文件夹结构循环并计算每个级别的文件夹数量(没有递归):

dir.create("top_level_folder/sub_folder_2/sub_folder_3", recursive = TRUE)
dir.create("top_level_folder/sub_folder_1")


dirs <- list.dirs()
branching_factor <- vector(length = length(dirs))
for (i in 1:length(dirs)) {
    branching_factor[i] <- length(list.dirs(path = dirs[i], 
                                            full.names = FALSE, recursive = FALSE))
}

result <- data.frame(Folders = basename(dirs), BranchingFactor = branching_factor)
result[-1,]

您还可以使用此代码的更短、更简洁和矢量化的版本:

dirs <- list.dirs()
branching_factor <- sapply(dirs, function(x) length(list.dirs(x, FALSE, FALSE)))
result2 <- data.frame(Folders = basename(dirs), BranchingFactor = branching_factor, 
                      row.names = NULL)[-1,]

结果看起来像这样:

> head(result2[rev(order(result2[,2])),])
          Folders BranchingFactor
208      fixtures              24
122      fixtures              23
42       fixtures              18
440      core_ext              17
340 active_record              17
562         rails              16

我递归地获取所有文件夹的列表,然后制作 table 文件夹子文件夹对,从中我可以按文件夹计算子文件夹的数量。

虽然我错过了空文件夹,所以我用左连接将它与初始文件夹重新合并,并用零填充 NA。

path <- getwd()
all_folders <- path %>% list.dirs(full.names=TRUE,recursive=TRUE) %>% 

data.frame(stringsAsFactors=FALSE) %>% setNames("Folders")
all_sub_folders <- all_folders$Folders %>%
  strsplit("/") %>%
  lapply(function(x){c(x[length(x)-1],x[length(x)])}) %>%
  do.call(rbind,.) %>%
  as.data.frame(stringsAsFactors=FALSE) %>%
  setNames(c("ParentFolders","Folders"))
output <- all_sub_folders$ParentFolders %>% table %>% as.data.frame(stringsAsFactors=FALSE) %>% setNames(c("Folders","SubFolders")))
output <- merge(all_sub_folders,output,all.x = TRUE)[,c("Folders","SubFolders")]
output$SubFolders[is.na(output$SubFolders)] <- 0
output <- output[match(all_sub_folders$Folders,output$Folders),]

head(output)
#      Folders SubFolders
# 2160   Rhome        126
# 17   acepack          5
# 856     help          1
# 992     html          9
# 1486    libs        124
# 1130    i386          0

只是纠正@Gilles 的解决方案,

path <- "SO/rails-master/"
dirs <- list.dirs(path)
branching_factor <- vector(length = length(dirs))
for (i in 1:length(dirs)) {
   branching_factor[i] <- length(list.dirs(path = dirs[i], recursive = FALSE))
}

result <- data.frame(Folders = basename(dirs), BranchingFactor = branching_factor)

> head(result)
       Folders BranchingFactor
1 rails-master              14
2      .github               0
3  actioncable               4
4          app               1
5       assets               1
6  javascripts               1

希望对您有所帮助。

您可以调整 on ,用 recursive = FALSE 代替 list.dirs 代替 list.files:

library(purrr)

files <- .libPaths()[1] %>%    # omit for current directory or supply alternate path
    list.dirs() %>% 
    map_df(~list(path = .x, 
                 dirs = length(list.dirs(.x, recursive = FALSE))))

files
#> # A tibble: 4,457 x 2
#>                                                                           path  dirs
#>                                                                          <chr> <int>
#>  1              /Library/Frameworks/R.framework/Versions/3.4/Resources/library   314
#>  2        /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind     4
#>  3   /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/help     0
#>  4   /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/html     0
#>  5   /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/Meta     0
#>  6      /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/R     0
#>  7      /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack     5
#>  8 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack/help     0
#>  9 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack/html     0
#> 10 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack/libs     1
#> # ... with 4,447 more rows

mean(files$dirs[files$dirs != 0])
#> [1] 2.952949

或以 R 为基数,

files <- do.call(rbind, lapply(list.dirs(.libPaths()[1]), function(path){
    data.frame(path = path, 
               dirs = length(list.dirs(path, recursive = FALSE)), 
               stringsAsFactors = FALSE)
}))

head(files)
#>                                                                        path dirs
#> 1            /Library/Frameworks/R.framework/Versions/3.4/Resources/library  314
#> 2      /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind    4
#> 3 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/help    0
#> 4 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/html    0
#> 5 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/Meta    0
#> 6    /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/R    0

mean(files$dirs[files$dirs != 0])
#> [1] 2.952949

averageBranchingFactor 不包括叶子。 旁注:您可以直接使用 data(acme).

获得 acme
library(data.tree)
data(acme)
acme$averageBranchingFactor
acme$count
print(acme, abf = "averageBranchingFactor", "count")

这将显示为:

                          levelName abf count
1  Acme Inc.                        2.5     3
2   ¦--Accounting                   2.0     2
3   ¦   ¦--New Software             0.0     0
4   ¦   °--New Accounting Standards 0.0     0
5   ¦--Research                     2.0     2
6   ¦   ¦--New Product Line         0.0     0
7   ¦   °--New Labs                 0.0     0
8   °--IT                           3.0     3
9       ¦--Outsource                0.0     0
10      ¦--Go agile                 0.0     0
11      °--Switch to R              0.0     0

?averageBranchingFactor 的实现没有任何秘密,因此您可以根据需要对其进行调整。只需在您的控制台中输入 averageBranchingFactor(不带括号):

function (node) 
{
    t <- Traverse(node, filterFun = isNotLeaf)
    if (length(t) == 0) 
        return(0)
    cnt <- Get(t, "count")
    if (!is.numeric(cnt)) 
        browser()
    return(mean(cnt))
}

简而言之,我们遍历树(叶子除外),并得到每个节点的count值。最后,我们计算平均值。

希望对您有所帮助。