如何找到百分位数然后在 R 中分组

How to find percentile and then group in R

我有一个如下所示的数据框 (df)。

day  area   hour  time  count
___  ____  _____  ___   ____
 1    1      0     1     10
 1    1      0     2     12
 1    1      0     3     8
 1    1      0     4     12    
 1    1      0     5     15  
 1    1      0     6     18 
 1    1      1     1     10
 1    1      1     2     12
 1    1      1     3     8
 1    1      1     4     12    
 1    1      1     5     15  
 1    1      1     6     18
 1    1      1     7     12    
 1    1      1     8     15  
 1    1      1     9     18
 1    1      2     1     10    
 1    1      2     2     18  
 1    1      2     3     19
 .....
 2    1      0     1     18
 2    1      0     2     12
 2    1      0     3     18
 2    1      0     4     12    
 2    1      1     1     8
 2    1      1     2     12
 2    1      1     3     18
 2    1      1     4     10    
 2    1      1     5     15  
 2    1      1     6     18
 2    1      1     7     12    
 2    1      1     8     15  
 2    1      1     9     18
 2    1      2     1     10    
 2    1      2     2     18  
 2    1      2     3     19
 2    1      2     4     9    
 2    1      2     5     18  
 2    1      2     6     9


..... 
 30    99      23     1     9    
 30    99      23     2     8  
 30    99      23     3     9
 30    99      23     4     19    
 30    99      23     5     18  
 30    99      23     6     9
 30    99      23     7     19    
 30    99      23     8     8  
 30    99      23     9     19

这里我有 87 个区域(1 到 82,然后我有 90、93、95、97、99)和 24 小时(0 到 23)每个 day.So 数据的数据是关于穿过该区域所花费的时间以及穿过了多少。

例如:

day  area   hour  time  count
___  ____  _____  ___   ____
 1    1      0     1     10
 1    1      0     2     12
 1    1      0     3     8
 1    1      0     4     12    
 1    1      0     5     15  
 1    1      0     6     18 

这让我在第 1 天的第 0 小时穿过区域 1

time  count   cumulative_count
___    ___    ________________
 1     10           10
 2     12           22
 3     8            30
 4     12           42    
 5     15           57
 6     18           75 
10 vehicles crossed the area in 1 minute.
12 vehicles crossed the area in 2 minutes.
8 vehicles crossed the area in 3 minutes.
12 vehicles crossed the area in 4 minutes.
15 vehicles crossed the area in 5 minutes.
18 vehicles crossed the area in 6 minutes.

据此我想计算 80% 的车辆在第 1 天穿过区域 1 小时花费了多少时间 0.So 车辆总数=(10+12+8+12+15+18 )=75.So 75 的 80% 是 60.So 80% 的车辆(75 的 80% 即 60)在第 1 天通过区域 1 的时间 0 将在 5 和6(将接近 5)。所以结果会是这样的:

 day  area   hour    time_taken_for_80%vehicles_to_pass
    ___  ____   ____    ___________________________________
     1    1      0                5.33(approximately)
     1    1      1                7.30
     1    1      2                2.16
    ....
     30   1      23               3.13
     1    2      0                ---
     1    2      1                ---
     1    2      2                ---
     1    2      3                ---

 .......

     30    99     21              ---
     30    99     22              ---
     30    99     23              ---

   I know to I have to take quantile and then group by the area and day and hour.So I tried with 

library(dplyr)
grp <- group_by(df, day,area,hour,quantile(df$count,0.8))

但它不会work.Any感谢帮助

我的解决方案计算每辆 time 穿过该区域的车辆的百分比。然后得到第一个time百分比在80%以上:

str <- 'day  area   hour  time  count
1    1      0     1     10
1    1      0     2     12
1    1      0     3     8
1    1      0     4     12    
1    1      0     5     15  
1    1      0     6     18
1    1      1     1     10
1    1      1     2     12
1    1      1     3     8
1    1      1     4     12    
1    1      1     5     15  
1    1      1     6     18
1    1      1     7     12    
1    1      1     8     15  
1    1      1     9     18
1    1      2     1     10    
1    1      2     2     18  
1    1      2     3     19'



file <- textConnection(str)
df <- read.table(file, header = T)

df

library(dplyr)
df %>% group_by(day, area, hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount)) %>%
  filter(p > 0.8) %>%
  summarise(time = min(time))

结果:

    day  area  hour  time
  <int> <int> <int> <int>
1     1     1     0     6
2     1     1     1     8
3     1     1     2     3

或者对达到 80% 的时间进行线性估计:

df %>% group_by(day, area, hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount),
         g = +(p > 0.8),
         order = (g*2-1)*time) %>%
  group_by(day, area, hour,g) %>%
  filter(row_number((g*2-1)*time)==1) %>%
  group_by(day, area, hour) %>%
  summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))

结果:

    day  area  hour     time
  <int> <int> <int>    <dbl>
1     1     1     0 5.166667
2     1     1     1 7.600000
3     1     1     2 2.505263

或使用 laglead

得到相同的结果
df %>% group_by(day, area, hour) %>%
  arrange(hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount)) %>%
  filter((p >= 0.8&lag(p)<0.8)|(p < 0.8&lead(p)>=0.8)) %>%
  summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))