在 R 中确定每天的进程数 运行 和开始这些项目的平均天数

Determine the number of process running each day and average days of commencing those projects, in R

我有一个很大的流程数据集(它们的 ID)、开始日期和相应的结束日期。

我想要的分为两部分。首先,每天有多少进程运行。其次,运行 进程的平均天数为 running/commencement。

样本数据集如

> dput(df)
structure(list(Process = c("P001", "P002", "P003", "P004", "P005"
), Start = c("01-01-2020", "02-01-2020", "03-01-2020", "08-01-2020", 
"13-01-2020"), End = c("10-01-2020", "09-01-2020", "04-01-2020", 
"17-01-2020", "19-01-2020")), class = "data.frame", row.names = c(NA, 
-5L))

df

> df
  Process      Start        End
1    P001 01-01-2020 10-01-2020
2    P002 02-01-2020 09-01-2020
3    P003 03-01-2020 04-01-2020
4    P004 08-01-2020 17-01-2020
5    P005 13-01-2020 19-01-2020

第一部分我是这样进行的

library(tidyverse)

df %>% pivot_longer(cols = c(Start, End), names_to = 'event', values_to = 'dates') %>%
  mutate(dates = as.Date(dates, format = "%d-%m-%Y")) %>%
  mutate(dates = if_else(event == 'End', dates+1, dates)) %>%
  arrange(dates, event) %>%
  mutate(processes = ifelse(event == 'Start', 1, -1),
         processes = cumsum(processes)) %>%
  select(-Process, -event) %>%
  complete(dates = seq.Date(min(dates), max(dates), by = '1 day')) %>%
  fill(processes)

# A tibble: 20 x 2
   dates      processes
   <date>         <dbl>
 1 2020-01-01         1
 2 2020-01-02         2
 3 2020-01-03         3
 4 2020-01-04         3
 5 2020-01-05         2
 6 2020-01-06         2
 7 2020-01-07         2
 8 2020-01-08         3
 9 2020-01-09         3
10 2020-01-10         2
11 2020-01-11         1
12 2020-01-12         1
13 2020-01-13         2
14 2020-01-14         2
15 2020-01-15         2
16 2020-01-16         2
17 2020-01-17         2
18 2020-01-18         1
19 2020-01-19         1
20 2020-01-20         0

对于第二部分,所需的输出类似于以下屏幕截图中的列 mean days,并附有解释 -

tidyverse 方法将是首选,请。

这是一种方法:

library(tidyverse)

df %>%
  #Convert to date
  mutate(across(c(Start, End), lubridate::dmy),
  #Create a sequence of dates from start to end
        Dates = map2(Start, End, seq, by = 'day')) %>%
  #Get data in long format
  unnest(Dates) %>%
  #Remove columns
  select(-Start, -End) %>%
  #For each process
  group_by(Process) %>%
  #Count number of days spent on it
  mutate(days_spent = row_number() - 1) %>%
  #For each date
  group_by(Dates) %>%
  #Count number of process running and average days
  summarise(process = n(), 
            mean_days = mean(days_spent))

这个returns:

#   Dates      process mean_days
#   <date>       <int>     <dbl>
# 1 2020-01-01       1      0   
# 2 2020-01-02       2      0.5 
# 3 2020-01-03       3      1   
# 4 2020-01-04       3      2   
# 5 2020-01-05       2      3.5 
# 6 2020-01-06       2      4.5 
# 7 2020-01-07       2      5.5 
# 8 2020-01-08       3      4.33
# 9 2020-01-09       3      5.33
#10 2020-01-10       2      5.5 
#11 2020-01-11       1      3   
#12 2020-01-12       1      4   
#13 2020-01-13       2      2.5 
#14 2020-01-14       2      3.5 
#15 2020-01-15       2      4.5 
#16 2020-01-16       2      5.5 
#17 2020-01-17       2      6.5 
#18 2020-01-18       1      5   
#19 2020-01-19       1      6