筛选第一个(最小)日期
Filter on the first (min) date
我的数据大概是这样的:
Snap Date ID Stage
1 2014-01-01 A1 One
2 2014-01-02 A1 One
3 2014-01-03 A1 One
4 2014-01-04 A1 Two
5 2014-01-05 A1 Two
6 2014-01-01 B9 One
7 2014-01-02 B9 One
8 2014-01-03 B9 Two
9 2014-01-04 B9 Three
如何过滤 Stage
实际更改的条目并删除其间的所有其他内容。
期望的输出:
Snap Date ID Stage
1 2014-01-01 A1 One
4 2014-01-04 A1 Two
6 2014-01-01 B9 One
8 2014-01-03 B9 Two
9 2014-01-04 B9 Three
此外,如果有多个要过滤的列,解决方案可能会发生什么变化?
Snap Date ID Stage Colour
1 2014-01-01 A1 One Red
2 2014-01-02 A1 One Red
3 2014-01-03 A1 One Green
4 2014-01-04 A1 One Green
5 2014-01-05 A1 Two Green
6 2014-01-06 A1 Two Green
7 2014-01-07 A1 Two Blue
8 2014-01-08 A1 Two Blue
9 2014-01-09 A1 Three Blue
10 2014-01-10 A1 Three Blue
11 2014-01-11 A1 Four Blue
12 2014-01-12 A1 Four Blue
13 2014-01-13 A1 Four Blue
14 2014-01-14 A1 Four Blue
15 2014-01-15 A1 Four Blue
16 2014-01-04 B9 One Green
17 2014-01-05 B9 One Green
18 2014-01-06 B9 Two Green
19 2014-01-07 B9 Three Green
您可以使用 data.tables unique
函数及其 by
属性,您可以随意更新。
对于原题
library(data.table)
unique(setDT(df), by = c("ID", "Stage"))
# Snap Date ID Stage
# 1: 1 2014-01-01 A1 One
# 2: 4 2014-01-04 A1 Two
# 3: 6 2014-01-01 B9 One
# 4: 8 2014-01-03 B9 Two
# 5: 9 2014-01-04 B9 Three
对于 Edit3:只需 color
到 by
参数
unique(df, by = c("ID", "Stage", "Colour"))
# Snap Date ID Stage Colour
# 1: 1 2014-01-01 A1 One Red
# 2: 3 2014-01-03 A1 One Green
# 3: 5 2014-01-05 A1 Two Green
# 4: 7 2014-01-07 A1 Two Blue
# 5: 9 2014-01-09 A1 Three Blue
# 6: 11 2014-01-11 A1 Four Blue
# 7: 16 2014-01-04 B9 One Green
# 8: 18 2014-01-06 B9 Two Green
# 9: 19 2014-01-07 B9 Three Green
其他选项正在使用 which.min
(如您所述)
df[, .SD[which.min(Date)], .(ID, Stage, Colour)]
或使用dplyr
library(dplyr)
distinct(df, ID, Stage, Colour)
dplyr 的另一个选项是:
DF %>%
mutate(Snap.Date = as.Date(Snap.Date)) %>% # make sure the dates are formatted correct
group_by(ID, Stage, Colour) %>% # group the data
slice(which.min(Snap.Date)) # slice off only those rows with the (first) minimum date per group
#Source: local data frame [9 x 4]
#Groups: ID, Stage, Colour
#
# Snap.Date ID Stage Colour
#1 2014-01-11 A1 Four Blue
#2 2014-01-03 A1 One Green
#3 2014-01-01 A1 One Red
#4 2014-01-09 A1 Three Blue
#5 2014-01-07 A1 Two Blue
#6 2014-01-05 A1 Two Green
#7 2014-01-04 B9 One Green
#8 2014-01-07 B9 Three Green
#9 2014-01-06 B9 Two Green
这种方法不需要预先对数据进行排序。
我的数据大概是这样的:
Snap Date ID Stage
1 2014-01-01 A1 One
2 2014-01-02 A1 One
3 2014-01-03 A1 One
4 2014-01-04 A1 Two
5 2014-01-05 A1 Two
6 2014-01-01 B9 One
7 2014-01-02 B9 One
8 2014-01-03 B9 Two
9 2014-01-04 B9 Three
如何过滤 Stage
实际更改的条目并删除其间的所有其他内容。
期望的输出:
Snap Date ID Stage
1 2014-01-01 A1 One
4 2014-01-04 A1 Two
6 2014-01-01 B9 One
8 2014-01-03 B9 Two
9 2014-01-04 B9 Three
此外,如果有多个要过滤的列,解决方案可能会发生什么变化?
Snap Date ID Stage Colour
1 2014-01-01 A1 One Red
2 2014-01-02 A1 One Red
3 2014-01-03 A1 One Green
4 2014-01-04 A1 One Green
5 2014-01-05 A1 Two Green
6 2014-01-06 A1 Two Green
7 2014-01-07 A1 Two Blue
8 2014-01-08 A1 Two Blue
9 2014-01-09 A1 Three Blue
10 2014-01-10 A1 Three Blue
11 2014-01-11 A1 Four Blue
12 2014-01-12 A1 Four Blue
13 2014-01-13 A1 Four Blue
14 2014-01-14 A1 Four Blue
15 2014-01-15 A1 Four Blue
16 2014-01-04 B9 One Green
17 2014-01-05 B9 One Green
18 2014-01-06 B9 Two Green
19 2014-01-07 B9 Three Green
您可以使用 data.tables unique
函数及其 by
属性,您可以随意更新。
对于原题
library(data.table)
unique(setDT(df), by = c("ID", "Stage"))
# Snap Date ID Stage
# 1: 1 2014-01-01 A1 One
# 2: 4 2014-01-04 A1 Two
# 3: 6 2014-01-01 B9 One
# 4: 8 2014-01-03 B9 Two
# 5: 9 2014-01-04 B9 Three
对于 Edit3:只需 color
到 by
参数
unique(df, by = c("ID", "Stage", "Colour"))
# Snap Date ID Stage Colour
# 1: 1 2014-01-01 A1 One Red
# 2: 3 2014-01-03 A1 One Green
# 3: 5 2014-01-05 A1 Two Green
# 4: 7 2014-01-07 A1 Two Blue
# 5: 9 2014-01-09 A1 Three Blue
# 6: 11 2014-01-11 A1 Four Blue
# 7: 16 2014-01-04 B9 One Green
# 8: 18 2014-01-06 B9 Two Green
# 9: 19 2014-01-07 B9 Three Green
其他选项正在使用 which.min
(如您所述)
df[, .SD[which.min(Date)], .(ID, Stage, Colour)]
或使用dplyr
library(dplyr)
distinct(df, ID, Stage, Colour)
dplyr 的另一个选项是:
DF %>%
mutate(Snap.Date = as.Date(Snap.Date)) %>% # make sure the dates are formatted correct
group_by(ID, Stage, Colour) %>% # group the data
slice(which.min(Snap.Date)) # slice off only those rows with the (first) minimum date per group
#Source: local data frame [9 x 4]
#Groups: ID, Stage, Colour
#
# Snap.Date ID Stage Colour
#1 2014-01-11 A1 Four Blue
#2 2014-01-03 A1 One Green
#3 2014-01-01 A1 One Red
#4 2014-01-09 A1 Three Blue
#5 2014-01-07 A1 Two Blue
#6 2014-01-05 A1 Two Green
#7 2014-01-04 B9 One Green
#8 2014-01-07 B9 Three Green
#9 2014-01-06 B9 Two Green
这种方法不需要预先对数据进行排序。