根据 R 中的模式删除观察结果
Remove observations according to a pattern in R
我有一个数据框,其中包含对足球受伤情况的观察。不幸的是,我有好几支球队可供选择。这是数据框的一部分:
df_x = data.frame(injury_id=c(250, 250, 100, 328, 328, 329, 329, 330, 330, 15, 5106, 5106, 5106),
player_id=c(109, 109, 39728, 2374, 2374, 2374, 2374, 2374, 2374, 26, 59016, 59016, 59016),
season=c(2011, 2011, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2012, 2012, 2012),
inury_from=c("2011-09-13", "2011-09-13", "2011-03-03", "2011-04-21", "2011-04-21", "2010-11-23", "2010-11-23", "2010-10-01", "2010-10-01", "2011-02-24", "2012-09-16", "2012-09-16", "2012-09-16"),
injury_until=c("2011-09-27", "2011-09-27", "2011-03-17", "2011-08-31", "2011-08-31", "2011-03-14", "2011-03-14", "2010-11-22", "2010-11-22", "2011-02-28", "2012-10-28", "2012-10-28", "2012-10-28"),
team_id=c(1, 2, 3, 4, 5, 4, 5, 4, 5, 6, 7, 8, 9),
member_since=c("1998-07-01", NA, "2009-07-01", "2008-07-01", NA, "2008-07-01", NA, "2008-07-01", NA, "2002-07-01", "2012-07-01", "2013-01-01", "2011-07-01"))
我的目标是每个 injury_id 只有一行。结果应出现以下数据框:
df_result_x = data.frame(injury_id=c(250, 100, 328, 329, 330, 15, 5106),
player_id=c(109, 39728, 2374, 2374, 2374, 26, 59016),
season=c(2011, 2010, 2010, 2010, 2010, 2010, 2012),
inury_from=c("2011-09-13", "2011-03-03", "2011-04-21", "2010-11-23", "2010-10-01", "2011-02-24", "2012-09-16"),
injury_until=c("2011-09-27", "2011-03-17", "2011-08-31", "2011-03-14", "2010-11-22", "2011-02-28", "2012-10-28"),
team_id=c(1, 3, 4, 4, 4, 6, 7),
member_since=c("1998-07-01", "2009-07-01", "2008-07-01", "2008-07-01", "2008-07-01", "2002-07-01", "2012-07-01"))
算法 select 用于多个 injury_ids 的观察:
- 删除在 member_since.
处有 NA 的行
- 删除所有member_since晚于injury_until的行。
- 如果仍然存在重复观察,请选择 member_since 中日期较晚的观察。
我可以通过管道执行此操作还是必须使用循环?
谢谢。
2020 年 11 月 10 日更新:
df_x2 = data.frame(injury_id=c(250, 250, 100, 328, 328, 329, 329, 330, 330, 15, 5106, 5106, 5106),
player_id=c(109, 109, 39728, 2374, 2374, 2374, 2374, 2374, 2374, 26, 59016, 59016, 59016),
season=c(2011, 2011, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2012, 2012, 2012),
inury_from=c("2011-09-13", "2011-09-13", "2011-03-03", "2011-04-21", "2011-04-21", "2010-11-23", "2010-11-23", "2010-10-01", "2010-10-01", "2011-02-24", "2012-09-16", "2012-09-16", "2012-09-16"),
injury_until=c("2011-09-27", "2011-09-27", "2011-03-17", "2011-08-31", "2011-08-31", "2011-03-14", "2011-03-14", "2010-11-22", "2010-11-22", "2011-02-28", "2012-10-28", "2012-10-28", "2012-10-28"),
team_id=c(1, 2, 3, 4, 5, 4, 5, 4, 5, 6, 8, 9, 7),
member_since=c("1998-07-01", NA, "2009-07-01", "2008-07-01", NA, "2008-07-01", NA, "2008-07-01", NA, "2002-07-01", "2013-01-01", "2011-07-01", "2012-12-31"))
按'injury_id'分组后我们可以使用slice
library(dplyr)
df_x %>%
group_by(injury_id) %>%
slice(1) %>%
ungroup
或与distinct
df_x %>%
distinct(injury_id, .keep_all = TRUE)
或者如果NA
元素顺序不对,在'injury_id'上做一个arrange
,然后是基于'member_since'中NA元素的逻辑向量(这样 NA 将是最后一个)并且 Date
转换 'member_since' 然后使用 distinct
到 select 基于 'injury_id' 列的第一个唯一行
df_x %>%
arrange(injury_id, is.na(member_since), as.Date(member_since)) %>%
distinct(injury_id, .keep_all = TRUE)
更新
根据评论
df_x %>%
filter(!is.na(member_since)) %>%
mutate(injury_until = as.Date(injury_until),
member_since = as.Date(member_since)) %>%
mutate(ind = injury_until - member_since) %>%
group_by(injury_id) %>%
filter(ind == min(ind[ind > 0])) %>%
select(-ind)
-输出
# A tibble: 7 x 7
# Groups: injury_id [7]
# injury_id player_id season inury_from injury_until team_id member_since
# <dbl> <dbl> <dbl> <chr> <date> <dbl> <date>
#1 250 109 2011 2011-09-13 2011-09-27 1 1998-07-01
#2 100 39728 2010 2011-03-03 2011-03-17 3 2009-07-01
#3 328 2374 2010 2011-04-21 2011-08-31 4 2008-07-01
#4 329 2374 2010 2010-11-23 2011-03-14 4 2008-07-01
#5 330 2374 2010 2010-10-01 2010-11-22 4 2008-07-01
#6 15 26 2010 2011-02-24 2011-02-28 6 2002-07-01
#7 5106 59016 2012 2012-09-16 2012-10-28 7 2012-07-01
我有一个数据框,其中包含对足球受伤情况的观察。不幸的是,我有好几支球队可供选择。这是数据框的一部分:
df_x = data.frame(injury_id=c(250, 250, 100, 328, 328, 329, 329, 330, 330, 15, 5106, 5106, 5106),
player_id=c(109, 109, 39728, 2374, 2374, 2374, 2374, 2374, 2374, 26, 59016, 59016, 59016),
season=c(2011, 2011, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2012, 2012, 2012),
inury_from=c("2011-09-13", "2011-09-13", "2011-03-03", "2011-04-21", "2011-04-21", "2010-11-23", "2010-11-23", "2010-10-01", "2010-10-01", "2011-02-24", "2012-09-16", "2012-09-16", "2012-09-16"),
injury_until=c("2011-09-27", "2011-09-27", "2011-03-17", "2011-08-31", "2011-08-31", "2011-03-14", "2011-03-14", "2010-11-22", "2010-11-22", "2011-02-28", "2012-10-28", "2012-10-28", "2012-10-28"),
team_id=c(1, 2, 3, 4, 5, 4, 5, 4, 5, 6, 7, 8, 9),
member_since=c("1998-07-01", NA, "2009-07-01", "2008-07-01", NA, "2008-07-01", NA, "2008-07-01", NA, "2002-07-01", "2012-07-01", "2013-01-01", "2011-07-01"))
我的目标是每个 injury_id 只有一行。结果应出现以下数据框:
df_result_x = data.frame(injury_id=c(250, 100, 328, 329, 330, 15, 5106),
player_id=c(109, 39728, 2374, 2374, 2374, 26, 59016),
season=c(2011, 2010, 2010, 2010, 2010, 2010, 2012),
inury_from=c("2011-09-13", "2011-03-03", "2011-04-21", "2010-11-23", "2010-10-01", "2011-02-24", "2012-09-16"),
injury_until=c("2011-09-27", "2011-03-17", "2011-08-31", "2011-03-14", "2010-11-22", "2011-02-28", "2012-10-28"),
team_id=c(1, 3, 4, 4, 4, 6, 7),
member_since=c("1998-07-01", "2009-07-01", "2008-07-01", "2008-07-01", "2008-07-01", "2002-07-01", "2012-07-01"))
算法 select 用于多个 injury_ids 的观察:
- 删除在 member_since. 处有 NA 的行
- 删除所有member_since晚于injury_until的行。
- 如果仍然存在重复观察,请选择 member_since 中日期较晚的观察。
我可以通过管道执行此操作还是必须使用循环?
谢谢。
2020 年 11 月 10 日更新:
df_x2 = data.frame(injury_id=c(250, 250, 100, 328, 328, 329, 329, 330, 330, 15, 5106, 5106, 5106),
player_id=c(109, 109, 39728, 2374, 2374, 2374, 2374, 2374, 2374, 26, 59016, 59016, 59016),
season=c(2011, 2011, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2012, 2012, 2012),
inury_from=c("2011-09-13", "2011-09-13", "2011-03-03", "2011-04-21", "2011-04-21", "2010-11-23", "2010-11-23", "2010-10-01", "2010-10-01", "2011-02-24", "2012-09-16", "2012-09-16", "2012-09-16"),
injury_until=c("2011-09-27", "2011-09-27", "2011-03-17", "2011-08-31", "2011-08-31", "2011-03-14", "2011-03-14", "2010-11-22", "2010-11-22", "2011-02-28", "2012-10-28", "2012-10-28", "2012-10-28"),
team_id=c(1, 2, 3, 4, 5, 4, 5, 4, 5, 6, 8, 9, 7),
member_since=c("1998-07-01", NA, "2009-07-01", "2008-07-01", NA, "2008-07-01", NA, "2008-07-01", NA, "2002-07-01", "2013-01-01", "2011-07-01", "2012-12-31"))
按'injury_id'分组后我们可以使用slice
library(dplyr)
df_x %>%
group_by(injury_id) %>%
slice(1) %>%
ungroup
或与distinct
df_x %>%
distinct(injury_id, .keep_all = TRUE)
或者如果NA
元素顺序不对,在'injury_id'上做一个arrange
,然后是基于'member_since'中NA元素的逻辑向量(这样 NA 将是最后一个)并且 Date
转换 'member_since' 然后使用 distinct
到 select 基于 'injury_id' 列的第一个唯一行
df_x %>%
arrange(injury_id, is.na(member_since), as.Date(member_since)) %>%
distinct(injury_id, .keep_all = TRUE)
更新
根据评论
df_x %>%
filter(!is.na(member_since)) %>%
mutate(injury_until = as.Date(injury_until),
member_since = as.Date(member_since)) %>%
mutate(ind = injury_until - member_since) %>%
group_by(injury_id) %>%
filter(ind == min(ind[ind > 0])) %>%
select(-ind)
-输出
# A tibble: 7 x 7
# Groups: injury_id [7]
# injury_id player_id season inury_from injury_until team_id member_since
# <dbl> <dbl> <dbl> <chr> <date> <dbl> <date>
#1 250 109 2011 2011-09-13 2011-09-27 1 1998-07-01
#2 100 39728 2010 2011-03-03 2011-03-17 3 2009-07-01
#3 328 2374 2010 2011-04-21 2011-08-31 4 2008-07-01
#4 329 2374 2010 2010-11-23 2011-03-14 4 2008-07-01
#5 330 2374 2010 2010-10-01 2010-11-22 4 2008-07-01
#6 15 26 2010 2011-02-24 2011-02-28 6 2002-07-01
#7 5106 59016 2012 2012-09-16 2012-10-28 7 2012-07-01