如何根据分位数按日期删除行?
How to remove rows by date based on quantile?
我的问题如下:我想删除数据框中低于为每个日期定义的第 50 个百分位数的行。下面的例子说明了我的问题。
我有以下数据框:
date <- c("01.02.2011","01.02.2011","01.02.2011","01.02.2011","01.02.2011","01.02.2011",
"01.02.2011","01.02.2011","01.02.2011","01.02.2011",
"02.02.2011","02.02.2011","02.02.2011","02.02.2011","02.02.2011","02.02.2011",
"02.02.2011","02.02.2011","02.02.2011","02.02.2011")
date <- as.Date(date, format="%d.%m.%Y")
ID <- c("A","B","C","D","E","F","G","H","I","J",
"A","B","C","D","E","F","G","H","I","J")
values <- as.numeric(c("1","8","2","3","5","13","2","4","1","16",
"4","2","12","16","8","1","7","11","2","10"))
df <- data.frame(ID, date, values)
看起来像这样:
ID date values
1 A 2011-02-01 1
2 B 2011-02-01 8
3 C 2011-02-01 2
4 D 2011-02-01 3
5 E 2011-02-01 5
6 F 2011-02-01 13
7 G 2011-02-01 2
8 H 2011-02-01 4
9 I 2011-02-01 1
10 J 2011-02-01 16
11 A 2011-02-02 4
12 B 2011-02-02 2
13 C 2011-02-02 12
14 D 2011-02-02 16
15 E 2011-02-02 8
16 F 2011-02-02 1
17 G 2011-02-02 7
18 H 2011-02-02 11
19 I 2011-02-02 2
20 J 2011-02-02 10
我想删除每个日期值低于第 50 个百分位数(按日期定义)的所有行,以获得:
ID date values
2 B 2011-02-01 8
5 E 2011-02-01 5
6 F 2011-02-01 13
8 H 2011-02-01 4
10 J 2011-02-01 16
13 C 2011-02-02 12
14 D 2011-02-02 16
15 E 2011-02-02 8
18 H 2011-02-02 11
20 J 2011-02-02 10
如果需要对我的问题进行任何编辑,请随时告诉我
你有几种方法可以做到这一点。这里有一些解决方案,但还有更多方法可以做到这一点。他们都采用相同的想法:首先按日期计算中位数,然后过滤数据。
data.table
如果您想使用 data.table
,请先使用 :=
通过引用更新您的数据,然后进行过滤。如果您的数据集很大,data.table
是一种非常有效的方法。
library(data.table)
setDT(df)
df[, quant := quantile(values, probs = .5),by = "date"]
df2 <- df[values>quant]
df2[,'quant' := NULL]
df2
ID date values
1: B 2011-02-01 8
2: E 2011-02-01 5
3: F 2011-02-01 13
4: H 2011-02-01 4
5: J 2011-02-01 16
6: C 2011-02-02 12
7: D 2011-02-02 16
8: E 2011-02-02 8
9: H 2011-02-02 11
10: J 2011-02-02 10
dplyr
使用 dplyr
,您可以通过管道传输您的操作:按组计算分位数,然后过滤
library(dplyr)
df %>%
group_by(date) %>%
mutate(quant = quantile(values, .5)) %>%
filter(values>quant) %>%
select(-quant)
Groups: date [2]
ID date values
<fct> <date> <dbl>
1 B 2011-02-01 8
2 E 2011-02-01 5
3 F 2011-02-01 13
4 H 2011-02-01 4
5 J 2011-02-01 16
6 C 2011-02-02 12
7 D 2011-02-02 16
8 E 2011-02-02 8
9 H 2011-02-02 11
10 J 2011-02-02 10
我的问题如下:我想删除数据框中低于为每个日期定义的第 50 个百分位数的行。下面的例子说明了我的问题。
我有以下数据框:
date <- c("01.02.2011","01.02.2011","01.02.2011","01.02.2011","01.02.2011","01.02.2011",
"01.02.2011","01.02.2011","01.02.2011","01.02.2011",
"02.02.2011","02.02.2011","02.02.2011","02.02.2011","02.02.2011","02.02.2011",
"02.02.2011","02.02.2011","02.02.2011","02.02.2011")
date <- as.Date(date, format="%d.%m.%Y")
ID <- c("A","B","C","D","E","F","G","H","I","J",
"A","B","C","D","E","F","G","H","I","J")
values <- as.numeric(c("1","8","2","3","5","13","2","4","1","16",
"4","2","12","16","8","1","7","11","2","10"))
df <- data.frame(ID, date, values)
看起来像这样:
ID date values
1 A 2011-02-01 1
2 B 2011-02-01 8
3 C 2011-02-01 2
4 D 2011-02-01 3
5 E 2011-02-01 5
6 F 2011-02-01 13
7 G 2011-02-01 2
8 H 2011-02-01 4
9 I 2011-02-01 1
10 J 2011-02-01 16
11 A 2011-02-02 4
12 B 2011-02-02 2
13 C 2011-02-02 12
14 D 2011-02-02 16
15 E 2011-02-02 8
16 F 2011-02-02 1
17 G 2011-02-02 7
18 H 2011-02-02 11
19 I 2011-02-02 2
20 J 2011-02-02 10
我想删除每个日期值低于第 50 个百分位数(按日期定义)的所有行,以获得:
ID date values
2 B 2011-02-01 8
5 E 2011-02-01 5
6 F 2011-02-01 13
8 H 2011-02-01 4
10 J 2011-02-01 16
13 C 2011-02-02 12
14 D 2011-02-02 16
15 E 2011-02-02 8
18 H 2011-02-02 11
20 J 2011-02-02 10
如果需要对我的问题进行任何编辑,请随时告诉我
你有几种方法可以做到这一点。这里有一些解决方案,但还有更多方法可以做到这一点。他们都采用相同的想法:首先按日期计算中位数,然后过滤数据。
data.table
如果您想使用 data.table
,请先使用 :=
通过引用更新您的数据,然后进行过滤。如果您的数据集很大,data.table
是一种非常有效的方法。
library(data.table)
setDT(df)
df[, quant := quantile(values, probs = .5),by = "date"]
df2 <- df[values>quant]
df2[,'quant' := NULL]
df2
ID date values
1: B 2011-02-01 8
2: E 2011-02-01 5
3: F 2011-02-01 13
4: H 2011-02-01 4
5: J 2011-02-01 16
6: C 2011-02-02 12
7: D 2011-02-02 16
8: E 2011-02-02 8
9: H 2011-02-02 11
10: J 2011-02-02 10
dplyr
使用 dplyr
,您可以通过管道传输您的操作:按组计算分位数,然后过滤
library(dplyr)
df %>%
group_by(date) %>%
mutate(quant = quantile(values, .5)) %>%
filter(values>quant) %>%
select(-quant)
Groups: date [2]
ID date values
<fct> <date> <dbl>
1 B 2011-02-01 8
2 E 2011-02-01 5
3 F 2011-02-01 13
4 H 2011-02-01 4
5 J 2011-02-01 16
6 C 2011-02-02 12
7 D 2011-02-02 16
8 E 2011-02-02 8
9 H 2011-02-02 11
10 J 2011-02-02 10