使用 pivot_wider 或 R 的类似函数,重复测量数据
Using pivot_wider or similar function with R with repeat measurement data
我有一个患者数据框,格式为每张胸部 X 光片一行。我的专栏包括胸部 X 光片的特定测量值、胸部 X 光片的日期,然后是对给定患者相同的其他几列(如最终结果)。
例如:
+--------+------------+----------+------------+-------------+-----+-------+---------+
| pat_id | index_date | cxr_date | delta_date | cxr_measure | age | admit | outcome |
+--------+------------+----------+------------+-------------+-----+-------+---------+
| 1 | 1/2/2020 | 1/2/2020 | 0 | 0.1 | 55 | 1 | 0 |
| 1 | 1/2/2020 | 1/3/2020 | 1 | 0.3 | 55 | 1 | 0 |
| 1 | 1/2/2020 | 1/3/2020 | 1 | 0.5 | 55 | 1 | 0 |
| 2 | 2/1/2020 | 2/2/2020 | 1 | 0.2 | 59 | 0 | 0 |
| 2 | 2/1/2020 | 2/3/2020 | 2 | 0.9 | 59 | 0 | 0 |
| 3 | 1/6/2020 | 1/6/2020 | 0 | 0.7 | 66 | 1 | 1 |
+--------+------------+----------+------------+-------------+-----+-------+---------+
我想重新格式化 table 以便每位患者一行。我的结局 table 我认为应该类似于下面的内容,其中每个变量都变成了:cxr_measure_#
其中 #
是 delta_date
。在真实的数据集中,我会有很多这样的列(# 的范围从 -5 到 +30)。如果在同一个 delta_date 上有两个 rows/values,理想情况下我会取平均值。
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
| pat_id | index_date | first_cxr_date | cxr_measure_0 | cxr_measure_1 | cxr_measure_2 | age | admit | outcome |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
| 1 | 1/2/2020 | 1/2/2020 | 0.1 | 0.4 | NA | 55 | 1 | 0 |
| 2 | 2/1/2020 | 2/2/2020 | NA | 0.2 | 0.9 | 59 | 0 | 0 |
| 3 | 1/6/2020 | 1/6/2020 | 0.7 | NA | NA | 66 | 1 | 1 |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
有没有一种简单的方法可以在这两个 table 之间进行基本重塑?我玩过 pivot_longer 和 pivot_wider,但不确定如何 (1) 处理在变量名中获取 delta_date 以及 (2) 如何获取如果有两个重叠日期,则为平均值。同样好奇这是否在 python 中更容易完成(大部分数据管理使用 pandas,但随后在 R 中进行了一些额外的数据清理和分析)。
这是混合方法,使用 pivot_wider 计算 car_measures 的均值,并使用 dplyr 汇总函数来确定第一个 cxr_date。
df<- structure(list(pat_id = c(1L, 1L, 1L, 2L, 2L, 3L),
index_date = c("1/2/2020", "1/2/2020", "1/2/2020", "2/1/2020", "2/1/2020", "1/6/2020"),
cxr_date = c("1/2/2020", "1/3/2020", "1/3/2020", "2/2/2020", "2/3/2020", "1/6/2020"),
delta_date = c(0L, 1L, 1L, 1L, 2L, 0L),
cxr_measure = c(0.1, 0.3, 0.5, 0.2, 0.9, 0.7),
age = c(55L,55L, 55L, 59L, 59L, 66L),
admit = c(1L, 1L, 1L, 0L, 0L, 1L),
outcome = c(0L, 0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -6L))
library(tidyr)
library(dplyr)
answer <-pivot_wider(df, id_cols = -c("delta_date", "cxr_measure", "cxr_date"),
names_from = "delta_date",
values_from = c("cxr_measure"),
values_fn = list(cxr_measure = mean),
names_glue ='cxr_measure_{delta_date}')
firstdate <-df %>% group_by(pat_id) %>% summarize(first_cxr_date=min(as.Date(cxr_date, "%m/%d/%Y")))
answer <- left_join(answer, firstdate)
Joining, by = "pat_id"
# A tibble: 3 x 9
pat_id index_date age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2 first_cxr_date
<int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <date>
1 1 1/2/2020 55 1 0 0.1 0.4 NA 2020-01-02
2 2 2/1/2020 59 0 0 NA 0.2 0.9 2020-02-02
3 3 1/6/2020 66 1 1 0.7 NA NA 2020-01-06
我确定有一种方法可以将所有这些组合到一个函数调用中,但有时丑陋只是更快。
为了扩展@Dave2e 响应,您可以使用 group_by
然后 min
通过 pat_id
得到 first_cxr_date
,这让您可以组成一个简洁的功能解决方案。
library(tibble)
library(dplyr)
library(tidyr)
df <-
tribble(
~pat_id, ~index_date, ~cxr_date, ~delta_date, ~cxr_measure, ~age, ~admit, ~outcome,
1, '1/2/2020', '1/2/2020', 0, 0.1, 55, 1, 0,
1, '1/2/2020', '1/3/2020', 1, 0.3, 55, 1, 0,
1, '1/2/2020', '1/3/2020', 1, 0.5, 55, 1, 0,
2, '2/1/2020', '2/2/2020', 1, 0.2, 59, 0, 0,
2, '2/1/2020', '2/3/2020', 2, 0.9, 59, 0, 0,
3, '1/6/2020', '1/6/2020', 0, 0.7, 66, 1, 1)
df %>%
group_by(pat_id) %>% mutate(first_cxr_date = min(cxr_date)) %>% ungroup() %>% # set first_cxr_date as min of group by pat_id
pivot_wider(id_cols = -c(delta_date, cxr_measure, cxr_date)
, names_from = delta_date # column names from delta_date
, values_from = cxr_measure
, names_prefix = 'cxr_measure_' # paste string to column names
, values_fn = mean # combine with mean
)
# A tibble: 3 x 9
pat_id index_date age admit outcome first_cxr_date cxr_measure_0 cxr_measure_1 cxr_measure_2
<dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 1/2/2020 55 1 0 1/2/2020 0.1 0.4 NA
2 2 2/1/2020 59 0 0 2/2/2020 NA 0.2 0.9
3 3 1/6/2020 66 1 1 1/6/2020 0.7 NA NA
特别感谢亲爱的@Onyambu 先生,他今天教会了我一个宝贵的观点。
您也可以使用以下解决方案。请注意 .value
,当有多个列名要从数据创建时,它对 pivot_longer
特别有用。这里它告诉 pivot_wider
名称的一部分实际上是我们从中获取值的列的名称。
library(dplyr)
library(tidyr)
df %>%
group_by(pat_id) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = delta_date, values_from = cxr_measure,
names_glue = "{.value}_{delta_date}") %>%
mutate(across(cxr_measure_0:cxr_measure_2, ~ mean(.x, na.rm = TRUE))) %>%
select(-id) %>%
slice_head(n = 1)
# A tibble: 3 x 9
# Groups: pat_id [3]
pat_id index_date cxr_date age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2
<int> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
1 1 1/2/2020 1/2/2020 55 1 0 0.1 0.4 NaN
2 2 2/1/2020 2/2/2020 59 0 0 NaN 0.2 0.9
3 3 1/6/2020 1/6/2020 66 1 1 0.7 NaN NaN
我有一个患者数据框,格式为每张胸部 X 光片一行。我的专栏包括胸部 X 光片的特定测量值、胸部 X 光片的日期,然后是对给定患者相同的其他几列(如最终结果)。
例如:
+--------+------------+----------+------------+-------------+-----+-------+---------+
| pat_id | index_date | cxr_date | delta_date | cxr_measure | age | admit | outcome |
+--------+------------+----------+------------+-------------+-----+-------+---------+
| 1 | 1/2/2020 | 1/2/2020 | 0 | 0.1 | 55 | 1 | 0 |
| 1 | 1/2/2020 | 1/3/2020 | 1 | 0.3 | 55 | 1 | 0 |
| 1 | 1/2/2020 | 1/3/2020 | 1 | 0.5 | 55 | 1 | 0 |
| 2 | 2/1/2020 | 2/2/2020 | 1 | 0.2 | 59 | 0 | 0 |
| 2 | 2/1/2020 | 2/3/2020 | 2 | 0.9 | 59 | 0 | 0 |
| 3 | 1/6/2020 | 1/6/2020 | 0 | 0.7 | 66 | 1 | 1 |
+--------+------------+----------+------------+-------------+-----+-------+---------+
我想重新格式化 table 以便每位患者一行。我的结局 table 我认为应该类似于下面的内容,其中每个变量都变成了:cxr_measure_#
其中 #
是 delta_date
。在真实的数据集中,我会有很多这样的列(# 的范围从 -5 到 +30)。如果在同一个 delta_date 上有两个 rows/values,理想情况下我会取平均值。
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
| pat_id | index_date | first_cxr_date | cxr_measure_0 | cxr_measure_1 | cxr_measure_2 | age | admit | outcome |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
| 1 | 1/2/2020 | 1/2/2020 | 0.1 | 0.4 | NA | 55 | 1 | 0 |
| 2 | 2/1/2020 | 2/2/2020 | NA | 0.2 | 0.9 | 59 | 0 | 0 |
| 3 | 1/6/2020 | 1/6/2020 | 0.7 | NA | NA | 66 | 1 | 1 |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
有没有一种简单的方法可以在这两个 table 之间进行基本重塑?我玩过 pivot_longer 和 pivot_wider,但不确定如何 (1) 处理在变量名中获取 delta_date 以及 (2) 如何获取如果有两个重叠日期,则为平均值。同样好奇这是否在 python 中更容易完成(大部分数据管理使用 pandas,但随后在 R 中进行了一些额外的数据清理和分析)。
这是混合方法,使用 pivot_wider 计算 car_measures 的均值,并使用 dplyr 汇总函数来确定第一个 cxr_date。
df<- structure(list(pat_id = c(1L, 1L, 1L, 2L, 2L, 3L),
index_date = c("1/2/2020", "1/2/2020", "1/2/2020", "2/1/2020", "2/1/2020", "1/6/2020"),
cxr_date = c("1/2/2020", "1/3/2020", "1/3/2020", "2/2/2020", "2/3/2020", "1/6/2020"),
delta_date = c(0L, 1L, 1L, 1L, 2L, 0L),
cxr_measure = c(0.1, 0.3, 0.5, 0.2, 0.9, 0.7),
age = c(55L,55L, 55L, 59L, 59L, 66L),
admit = c(1L, 1L, 1L, 0L, 0L, 1L),
outcome = c(0L, 0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -6L))
library(tidyr)
library(dplyr)
answer <-pivot_wider(df, id_cols = -c("delta_date", "cxr_measure", "cxr_date"),
names_from = "delta_date",
values_from = c("cxr_measure"),
values_fn = list(cxr_measure = mean),
names_glue ='cxr_measure_{delta_date}')
firstdate <-df %>% group_by(pat_id) %>% summarize(first_cxr_date=min(as.Date(cxr_date, "%m/%d/%Y")))
answer <- left_join(answer, firstdate)
Joining, by = "pat_id"
# A tibble: 3 x 9
pat_id index_date age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2 first_cxr_date
<int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <date>
1 1 1/2/2020 55 1 0 0.1 0.4 NA 2020-01-02
2 2 2/1/2020 59 0 0 NA 0.2 0.9 2020-02-02
3 3 1/6/2020 66 1 1 0.7 NA NA 2020-01-06
我确定有一种方法可以将所有这些组合到一个函数调用中,但有时丑陋只是更快。
为了扩展@Dave2e 响应,您可以使用 group_by
然后 min
通过 pat_id
得到 first_cxr_date
,这让您可以组成一个简洁的功能解决方案。
library(tibble)
library(dplyr)
library(tidyr)
df <-
tribble(
~pat_id, ~index_date, ~cxr_date, ~delta_date, ~cxr_measure, ~age, ~admit, ~outcome,
1, '1/2/2020', '1/2/2020', 0, 0.1, 55, 1, 0,
1, '1/2/2020', '1/3/2020', 1, 0.3, 55, 1, 0,
1, '1/2/2020', '1/3/2020', 1, 0.5, 55, 1, 0,
2, '2/1/2020', '2/2/2020', 1, 0.2, 59, 0, 0,
2, '2/1/2020', '2/3/2020', 2, 0.9, 59, 0, 0,
3, '1/6/2020', '1/6/2020', 0, 0.7, 66, 1, 1)
df %>%
group_by(pat_id) %>% mutate(first_cxr_date = min(cxr_date)) %>% ungroup() %>% # set first_cxr_date as min of group by pat_id
pivot_wider(id_cols = -c(delta_date, cxr_measure, cxr_date)
, names_from = delta_date # column names from delta_date
, values_from = cxr_measure
, names_prefix = 'cxr_measure_' # paste string to column names
, values_fn = mean # combine with mean
)
# A tibble: 3 x 9
pat_id index_date age admit outcome first_cxr_date cxr_measure_0 cxr_measure_1 cxr_measure_2
<dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 1/2/2020 55 1 0 1/2/2020 0.1 0.4 NA
2 2 2/1/2020 59 0 0 2/2/2020 NA 0.2 0.9
3 3 1/6/2020 66 1 1 1/6/2020 0.7 NA NA
特别感谢亲爱的@Onyambu 先生,他今天教会了我一个宝贵的观点。
您也可以使用以下解决方案。请注意 .value
,当有多个列名要从数据创建时,它对 pivot_longer
特别有用。这里它告诉 pivot_wider
名称的一部分实际上是我们从中获取值的列的名称。
library(dplyr)
library(tidyr)
df %>%
group_by(pat_id) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = delta_date, values_from = cxr_measure,
names_glue = "{.value}_{delta_date}") %>%
mutate(across(cxr_measure_0:cxr_measure_2, ~ mean(.x, na.rm = TRUE))) %>%
select(-id) %>%
slice_head(n = 1)
# A tibble: 3 x 9
# Groups: pat_id [3]
pat_id index_date cxr_date age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2
<int> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
1 1 1/2/2020 1/2/2020 55 1 0 0.1 0.4 NaN
2 2 2/1/2020 2/2/2020 59 0 0 NaN 0.2 0.9
3 3 1/6/2020 1/6/2020 66 1 1 0.7 NaN NaN