如何找到百分位数然后在 R 中分组
How to find percentile and then group in R
我有一个如下所示的数据框 (df)。
day area hour time count
___ ____ _____ ___ ____
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
1 1 1 1 10
1 1 1 2 12
1 1 1 3 8
1 1 1 4 12
1 1 1 5 15
1 1 1 6 18
1 1 1 7 12
1 1 1 8 15
1 1 1 9 18
1 1 2 1 10
1 1 2 2 18
1 1 2 3 19
.....
2 1 0 1 18
2 1 0 2 12
2 1 0 3 18
2 1 0 4 12
2 1 1 1 8
2 1 1 2 12
2 1 1 3 18
2 1 1 4 10
2 1 1 5 15
2 1 1 6 18
2 1 1 7 12
2 1 1 8 15
2 1 1 9 18
2 1 2 1 10
2 1 2 2 18
2 1 2 3 19
2 1 2 4 9
2 1 2 5 18
2 1 2 6 9
.....
30 99 23 1 9
30 99 23 2 8
30 99 23 3 9
30 99 23 4 19
30 99 23 5 18
30 99 23 6 9
30 99 23 7 19
30 99 23 8 8
30 99 23 9 19
这里我有 87 个区域(1 到 82,然后我有 90、93、95、97、99)和 24 小时(0 到 23)每个 day.So 数据的数据是关于穿过该区域所花费的时间以及穿过了多少。
例如:
day area hour time count
___ ____ _____ ___ ____
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
这让我在第 1 天的第 0 小时穿过区域 1
time count cumulative_count
___ ___ ________________
1 10 10
2 12 22
3 8 30
4 12 42
5 15 57
6 18 75
10 vehicles crossed the area in 1 minute.
12 vehicles crossed the area in 2 minutes.
8 vehicles crossed the area in 3 minutes.
12 vehicles crossed the area in 4 minutes.
15 vehicles crossed the area in 5 minutes.
18 vehicles crossed the area in 6 minutes.
据此我想计算 80% 的车辆在第 1 天穿过区域 1 小时花费了多少时间 0.So 车辆总数=(10+12+8+12+15+18 )=75.So 75 的 80% 是 60.So 80% 的车辆(75 的 80% 即 60)在第 1 天通过区域 1 的时间 0 将在 5 和6(将接近 5)。所以结果会是这样的:
day area hour time_taken_for_80%vehicles_to_pass
___ ____ ____ ___________________________________
1 1 0 5.33(approximately)
1 1 1 7.30
1 1 2 2.16
....
30 1 23 3.13
1 2 0 ---
1 2 1 ---
1 2 2 ---
1 2 3 ---
.......
30 99 21 ---
30 99 22 ---
30 99 23 ---
I know to I have to take quantile and then group by the area and day and hour.So I tried with
library(dplyr)
grp <- group_by(df, day,area,hour,quantile(df$count,0.8))
但它不会work.Any感谢帮助
我的解决方案计算每辆 time
穿过该区域的车辆的百分比。然后得到第一个time
百分比在80%以上:
str <- 'day area hour time count
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
1 1 1 1 10
1 1 1 2 12
1 1 1 3 8
1 1 1 4 12
1 1 1 5 15
1 1 1 6 18
1 1 1 7 12
1 1 1 8 15
1 1 1 9 18
1 1 2 1 10
1 1 2 2 18
1 1 2 3 19'
file <- textConnection(str)
df <- read.table(file, header = T)
df
library(dplyr)
df %>% group_by(day, area, hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount)) %>%
filter(p > 0.8) %>%
summarise(time = min(time))
结果:
day area hour time
<int> <int> <int> <int>
1 1 1 0 6
2 1 1 1 8
3 1 1 2 3
或者对达到 80% 的时间进行线性估计:
df %>% group_by(day, area, hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount),
g = +(p > 0.8),
order = (g*2-1)*time) %>%
group_by(day, area, hour,g) %>%
filter(row_number((g*2-1)*time)==1) %>%
group_by(day, area, hour) %>%
summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))
结果:
day area hour time
<int> <int> <int> <dbl>
1 1 1 0 5.166667
2 1 1 1 7.600000
3 1 1 2 2.505263
或使用 lag
和 lead
得到相同的结果
df %>% group_by(day, area, hour) %>%
arrange(hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount)) %>%
filter((p >= 0.8&lag(p)<0.8)|(p < 0.8&lead(p)>=0.8)) %>%
summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))
我有一个如下所示的数据框 (df)。
day area hour time count
___ ____ _____ ___ ____
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
1 1 1 1 10
1 1 1 2 12
1 1 1 3 8
1 1 1 4 12
1 1 1 5 15
1 1 1 6 18
1 1 1 7 12
1 1 1 8 15
1 1 1 9 18
1 1 2 1 10
1 1 2 2 18
1 1 2 3 19
.....
2 1 0 1 18
2 1 0 2 12
2 1 0 3 18
2 1 0 4 12
2 1 1 1 8
2 1 1 2 12
2 1 1 3 18
2 1 1 4 10
2 1 1 5 15
2 1 1 6 18
2 1 1 7 12
2 1 1 8 15
2 1 1 9 18
2 1 2 1 10
2 1 2 2 18
2 1 2 3 19
2 1 2 4 9
2 1 2 5 18
2 1 2 6 9
.....
30 99 23 1 9
30 99 23 2 8
30 99 23 3 9
30 99 23 4 19
30 99 23 5 18
30 99 23 6 9
30 99 23 7 19
30 99 23 8 8
30 99 23 9 19
这里我有 87 个区域(1 到 82,然后我有 90、93、95、97、99)和 24 小时(0 到 23)每个 day.So 数据的数据是关于穿过该区域所花费的时间以及穿过了多少。
例如:
day area hour time count
___ ____ _____ ___ ____
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
这让我在第 1 天的第 0 小时穿过区域 1
time count cumulative_count
___ ___ ________________
1 10 10
2 12 22
3 8 30
4 12 42
5 15 57
6 18 75
10 vehicles crossed the area in 1 minute.
12 vehicles crossed the area in 2 minutes.
8 vehicles crossed the area in 3 minutes.
12 vehicles crossed the area in 4 minutes.
15 vehicles crossed the area in 5 minutes.
18 vehicles crossed the area in 6 minutes.
据此我想计算 80% 的车辆在第 1 天穿过区域 1 小时花费了多少时间 0.So 车辆总数=(10+12+8+12+15+18 )=75.So 75 的 80% 是 60.So 80% 的车辆(75 的 80% 即 60)在第 1 天通过区域 1 的时间 0 将在 5 和6(将接近 5)。所以结果会是这样的:
day area hour time_taken_for_80%vehicles_to_pass
___ ____ ____ ___________________________________
1 1 0 5.33(approximately)
1 1 1 7.30
1 1 2 2.16
....
30 1 23 3.13
1 2 0 ---
1 2 1 ---
1 2 2 ---
1 2 3 ---
.......
30 99 21 ---
30 99 22 ---
30 99 23 ---
I know to I have to take quantile and then group by the area and day and hour.So I tried with
library(dplyr)
grp <- group_by(df, day,area,hour,quantile(df$count,0.8))
但它不会work.Any感谢帮助
我的解决方案计算每辆 time
穿过该区域的车辆的百分比。然后得到第一个time
百分比在80%以上:
str <- 'day area hour time count
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
1 1 1 1 10
1 1 1 2 12
1 1 1 3 8
1 1 1 4 12
1 1 1 5 15
1 1 1 6 18
1 1 1 7 12
1 1 1 8 15
1 1 1 9 18
1 1 2 1 10
1 1 2 2 18
1 1 2 3 19'
file <- textConnection(str)
df <- read.table(file, header = T)
df
library(dplyr)
df %>% group_by(day, area, hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount)) %>%
filter(p > 0.8) %>%
summarise(time = min(time))
结果:
day area hour time
<int> <int> <int> <int>
1 1 1 0 6
2 1 1 1 8
3 1 1 2 3
或者对达到 80% 的时间进行线性估计:
df %>% group_by(day, area, hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount),
g = +(p > 0.8),
order = (g*2-1)*time) %>%
group_by(day, area, hour,g) %>%
filter(row_number((g*2-1)*time)==1) %>%
group_by(day, area, hour) %>%
summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))
结果:
day area hour time
<int> <int> <int> <dbl>
1 1 1 0 5.166667
2 1 1 1 7.600000
3 1 1 2 2.505263
或使用 lag
和 lead
df %>% group_by(day, area, hour) %>%
arrange(hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount)) %>%
filter((p >= 0.8&lag(p)<0.8)|(p < 0.8&lead(p)>=0.8)) %>%
summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))