如何根据分位数按日期删除行?

How to remove rows by date based on quantile?

我的问题如下:我想删除数据框中低于为每个日期定义的第 50 个百分位数的行。下面的例子说明了我的问题。

我有以下数据框:

date <- c("01.02.2011","01.02.2011","01.02.2011","01.02.2011","01.02.2011","01.02.2011",
          "01.02.2011","01.02.2011","01.02.2011","01.02.2011",
          "02.02.2011","02.02.2011","02.02.2011","02.02.2011","02.02.2011","02.02.2011",
          "02.02.2011","02.02.2011","02.02.2011","02.02.2011")
date <- as.Date(date, format="%d.%m.%Y")
ID <- c("A","B","C","D","E","F","G","H","I","J",
        "A","B","C","D","E","F","G","H","I","J")
values <- as.numeric(c("1","8","2","3","5","13","2","4","1","16",
                       "4","2","12","16","8","1","7","11","2","10"))

df <- data.frame(ID, date, values)

看起来像这样:

   ID       date values
1   A 2011-02-01      1
2   B 2011-02-01      8
3   C 2011-02-01      2
4   D 2011-02-01      3
5   E 2011-02-01      5
6   F 2011-02-01     13
7   G 2011-02-01      2
8   H 2011-02-01      4
9   I 2011-02-01      1
10  J 2011-02-01     16
11  A 2011-02-02      4
12  B 2011-02-02      2
13  C 2011-02-02     12
14  D 2011-02-02     16
15  E 2011-02-02      8
16  F 2011-02-02      1
17  G 2011-02-02      7
18  H 2011-02-02     11
19  I 2011-02-02      2
20  J 2011-02-02     10

我想删除每个日期值低于第 50 个百分位数(按日期定义)的所有行,以获得:

   ID       date values
2   B 2011-02-01      8
5   E 2011-02-01      5
6   F 2011-02-01     13
8   H 2011-02-01      4
10  J 2011-02-01     16
13  C 2011-02-02     12
14  D 2011-02-02     16
15  E 2011-02-02      8
18  H 2011-02-02     11
20  J 2011-02-02     10

如果需要对我的问题进行任何编辑,请随时告诉我

你有几种方法可以做到这一点。这里有一些解决方案,但还有更多方法可以做到这一点。他们都采用相同的想法:首先按日期计算中位数,然后过滤数据。

data.table

如果您想使用 data.table,请先使用 := 通过引用更新您的数据,然后进行过滤。如果您的数据集很大,data.table 是一种非常有效的方法。

library(data.table)
setDT(df)

df[, quant := quantile(values, probs = .5),by = "date"]
df2 <- df[values>quant]
df2[,'quant' := NULL]

df2
    ID       date values
 1:  B 2011-02-01      8
 2:  E 2011-02-01      5
 3:  F 2011-02-01     13
 4:  H 2011-02-01      4
 5:  J 2011-02-01     16
 6:  C 2011-02-02     12
 7:  D 2011-02-02     16
 8:  E 2011-02-02      8
 9:  H 2011-02-02     11
10:  J 2011-02-02     10

dplyr

使用 dplyr,您可以通过管道传输您的操作:按组计算分位数,然后过滤

library(dplyr)
df %>%
   group_by(date) %>%
   mutate(quant = quantile(values, .5)) %>%
   filter(values>quant) %>%
   select(-quant)

Groups:   date [2]
   ID    date       values
   <fct> <date>      <dbl>
 1 B     2011-02-01      8
 2 E     2011-02-01      5
 3 F     2011-02-01     13
 4 H     2011-02-01      4
 5 J     2011-02-01     16
 6 C     2011-02-02     12
 7 D     2011-02-02     16
 8 E     2011-02-02      8
 9 H     2011-02-02     11
10 J     2011-02-02     10