R 中用于展平的 max 和 over(partition by) 的等价物
Equivalent of max and over(partition by) in R for flattening
我的数据为:
ID time0 obs_num recorded_dt day0 day1 day2 day3 day4 day5 ... day31
1 2009-01-01 A 2009-01-01 A NULL NULL NULL NULL NULL ... NULL
1 2009-01-01 D 2009-01-31 NULL NULL NULL NULL NULL NULL ... D
1 2009-01-01 B 2009-01-05 NULL NULL NULL NULL NULL B ... NULL
2 2005-02-02 B 2005-02-03 NULL B NULL NULL NULL NULL ... NULL
数据可以重现为:
example = data.frame(
ID = c(1,1,1,2),
time0 = c('2009-01-01','2009-01-01','2009-01-01','2005-02-02'),
obs_num = c('A','D','B','B'),
recorded_dt = c('2009-01-01','2009-01-31','2009-01-05','2005-02-03')
)
library(tidyverse)
df <- example %>%
mutate(difs_days = floor(difftime(recorded_dt, time0, units="days"))) %>%
arrange(difs_days) %>%
pivot_wider(names_from = difs_days, values_from = obs_num, names_prefix = 'day') %>%
arrange(ID, recorded_dt)
df
# # A tibble: 4 × 7
# ID time0 recorded_dt day0 day1 day4 day30
# <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2009-01-01 2009-01-01 A NA NA NA
# 2 1 2009-01-01 2009-01-05 NA NA B NA
# 3 1 2009-01-01 2009-01-31 NA NA NA D
# 4 2 2005-02-02 2005-02-03 NA B NA NA
我想将数据扁平化为:
ID time0 day0 day1 day2 day3 day4 day5 ... day31
1 2009-01-01 A NULL NULL NULL NULL B ... D
2 2005-02-02 NULL B NULL NULL NULL NULL ... NULL
在 SQL 中,我会使用 max(dayX) over(partition by ID) 作为 XYZ,然后保留不同的值。
我认为 R 中一定有一种有效的方法。你能帮忙吗?
您可以使用 across()
汇总多个列:
df %>%
group_by(ID, time0) %>%
summarise(across(day0:day30, ~ if(all(is.na(.x))) NA else max(.x, na.rm = TRUE))) %>%
ungroup()
# # A tibble: 2 × 6
# ID time0 day0 day1 day4 day30
# <dbl> <chr> <chr> <chr> <chr> <chr>
# 1 1 2009-01-01 A NA B D
# 2 2 2005-02-02 NA B NA NA
更新:
对于给定的示例,我们可以使用:summarise(across(everything(), ~trimws(paste(., collapse = ''))))
要用 NA
替换 ""
只需在代码末尾添加 na_if("")
:
library(dplyr)
example %>%
select(-recorded_dt) %>%
mutate(across(everything(), ~ifelse(is.na(.), "", .))) %>%
group_by(ID, time0) %>%
summarise(across(everything(), ~trimws(paste(., collapse = '')))) %>%
na_if("")
ID time0 day0 day1 day4 day30
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 2009-01-01 "A" "" "B" "D"
2 2 2005-02-02 "" "B" "" ""
使用data.table
:
library(data.table)
setDT(df)[
, recorded_dt:=NULL][
, lapply(.SD, \(x) sort(x, na.last = TRUE, decreasing = TRUE)[1])
, by=.(ID, time0)]
## ID time0 day0 day1 day4 day30
## 1: 1 2009-01-01 A <NA> B D
## 2: 2 2005-02-02 <NA> B <NA> <NA>
内部变量 .SD
表示 data.table
的子集,包括除 by=...
子句中包含的列之外的所有列。这就是为什么我们必须先删除列 recorded_dt
。
我的数据为:
ID time0 obs_num recorded_dt day0 day1 day2 day3 day4 day5 ... day31 1 2009-01-01 A 2009-01-01 A NULL NULL NULL NULL NULL ... NULL 1 2009-01-01 D 2009-01-31 NULL NULL NULL NULL NULL NULL ... D 1 2009-01-01 B 2009-01-05 NULL NULL NULL NULL NULL B ... NULL 2 2005-02-02 B 2005-02-03 NULL B NULL NULL NULL NULL ... NULL
数据可以重现为:
example = data.frame(
ID = c(1,1,1,2),
time0 = c('2009-01-01','2009-01-01','2009-01-01','2005-02-02'),
obs_num = c('A','D','B','B'),
recorded_dt = c('2009-01-01','2009-01-31','2009-01-05','2005-02-03')
)
library(tidyverse)
df <- example %>%
mutate(difs_days = floor(difftime(recorded_dt, time0, units="days"))) %>%
arrange(difs_days) %>%
pivot_wider(names_from = difs_days, values_from = obs_num, names_prefix = 'day') %>%
arrange(ID, recorded_dt)
df
# # A tibble: 4 × 7
# ID time0 recorded_dt day0 day1 day4 day30
# <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2009-01-01 2009-01-01 A NA NA NA
# 2 1 2009-01-01 2009-01-05 NA NA B NA
# 3 1 2009-01-01 2009-01-31 NA NA NA D
# 4 2 2005-02-02 2005-02-03 NA B NA NA
我想将数据扁平化为:
ID time0 day0 day1 day2 day3 day4 day5 ... day31 1 2009-01-01 A NULL NULL NULL NULL B ... D 2 2005-02-02 NULL B NULL NULL NULL NULL ... NULL
在 SQL 中,我会使用 max(dayX) over(partition by ID) 作为 XYZ,然后保留不同的值。 我认为 R 中一定有一种有效的方法。你能帮忙吗?
您可以使用 across()
汇总多个列:
df %>%
group_by(ID, time0) %>%
summarise(across(day0:day30, ~ if(all(is.na(.x))) NA else max(.x, na.rm = TRUE))) %>%
ungroup()
# # A tibble: 2 × 6
# ID time0 day0 day1 day4 day30
# <dbl> <chr> <chr> <chr> <chr> <chr>
# 1 1 2009-01-01 A NA B D
# 2 2 2005-02-02 NA B NA NA
更新:
对于给定的示例,我们可以使用:summarise(across(everything(), ~trimws(paste(., collapse = ''))))
要用 NA
替换 ""
只需在代码末尾添加 na_if("")
:
library(dplyr)
example %>%
select(-recorded_dt) %>%
mutate(across(everything(), ~ifelse(is.na(.), "", .))) %>%
group_by(ID, time0) %>%
summarise(across(everything(), ~trimws(paste(., collapse = '')))) %>%
na_if("")
ID time0 day0 day1 day4 day30
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 2009-01-01 "A" "" "B" "D"
2 2 2005-02-02 "" "B" "" ""
使用data.table
:
library(data.table)
setDT(df)[
, recorded_dt:=NULL][
, lapply(.SD, \(x) sort(x, na.last = TRUE, decreasing = TRUE)[1])
, by=.(ID, time0)]
## ID time0 day0 day1 day4 day30
## 1: 1 2009-01-01 A <NA> B D
## 2: 2 2005-02-02 <NA> B <NA> <NA>
内部变量 .SD
表示 data.table
的子集,包括除 by=...
子句中包含的列之外的所有列。这就是为什么我们必须先删除列 recorded_dt
。