使用 pivot_wider 或 R 的类似函数，重复测量数据

Question

我有一个患者数据框，格式为每张胸部 X 光片一行。我的专栏包括胸部 X 光片的特定测量值、胸部 X 光片的日期，然后是对给定患者相同的其他几列（如最终结果）。

例如：

+--------+------------+----------+------------+-------------+-----+-------+---------+
| pat_id | index_date | cxr_date | delta_date | cxr_measure | age | admit | outcome |
+--------+------------+----------+------------+-------------+-----+-------+---------+
|      1 | 1/2/2020   | 1/2/2020 |          0 |         0.1 |  55 |     1 |       0 |
|      1 | 1/2/2020   | 1/3/2020 |          1 |         0.3 |  55 |     1 |       0 |
|      1 | 1/2/2020   | 1/3/2020 |          1 |         0.5 |  55 |     1 |       0 |
|      2 | 2/1/2020   | 2/2/2020 |          1 |         0.2 |  59 |     0 |       0 |
|      2 | 2/1/2020   | 2/3/2020 |          2 |         0.9 |  59 |     0 |       0 |
|      3 | 1/6/2020   | 1/6/2020 |          0 |         0.7 |  66 |     1 |       1 |
+--------+------------+----------+------------+-------------+-----+-------+---------+

我想重新格式化 table 以便每位患者一行。我的结局 table 我认为应该类似于下面的内容，其中每个变量都变成了：cxr_measure_# 其中 # 是 delta_date。在真实的数据集中，我会有很多这样的列（# 的范围从 -5 到 +30）。如果在同一个 delta_date 上有两个 rows/values，理想情况下我会取平均值。

+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
| pat_id | index_date | first_cxr_date | cxr_measure_0 | cxr_measure_1 | cxr_measure_2 | age | admit | outcome |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
|      1 | 1/2/2020   | 1/2/2020       | 0.1           | 0.4           | NA           |  55 |     1 |       0 |
|      2 | 2/1/2020   | 2/2/2020       | NA            | 0.2           | 0.9          |  59 |     0 |       0 |
|      3 | 1/6/2020   | 1/6/2020       | 0.7           | NA            | NA           |  66 |     1 |       1 |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+

有没有一种简单的方法可以在这两个 table 之间进行基本重塑？我玩过 pivot_longer 和 pivot_wider，但不确定如何 (1) 处理在变量名中获取 delta_date 以及 (2) 如何获取如果有两个重叠日期，则为平均值。同样好奇这是否在 python 中更容易完成（大部分数据管理使用 pandas，但随后在 R 中进行了一些额外的数据清理和分析）。

Answer 1

这是混合方法，使用 pivot_wider 计算 car_measures 的均值，并使用 dplyr 汇总函数来确定第一个 cxr_date。

df<- structure(list(pat_id = c(1L, 1L, 1L, 2L, 2L, 3L), 
                    index_date = c("1/2/2020",  "1/2/2020", "1/2/2020", "2/1/2020", "2/1/2020", "1/6/2020"), 
                    cxr_date = c("1/2/2020", "1/3/2020", "1/3/2020", "2/2/2020",  "2/3/2020", "1/6/2020"), 
                    delta_date = c(0L, 1L, 1L, 1L, 2L, 0L), 
                    cxr_measure = c(0.1, 0.3, 0.5, 0.2, 0.9, 0.7), 
                    age = c(55L,55L, 55L, 59L, 59L, 66L), 
                    admit = c(1L, 1L, 1L, 0L, 0L, 1L), 
                    outcome = c(0L, 0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -6L))

library(tidyr)
library(dplyr)

answer <-pivot_wider(df, id_cols = -c("delta_date", "cxr_measure", "cxr_date"), 
            names_from = "delta_date", 
            values_from = c("cxr_measure"),
            values_fn = list(cxr_measure = mean),
            names_glue ='cxr_measure_{delta_date}') 

 firstdate <-df %>% group_by(pat_id) %>% summarize(first_cxr_date=min(as.Date(cxr_date, "%m/%d/%Y")))
 
answer <- left_join(answer, firstdate)
Joining, by = "pat_id"
# A tibble: 3 x 9
  pat_id index_date   age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2 first_cxr_date
   <int>       <chr>   <int> <int>   <int>         <dbl>         <dbl>         <dbl>    <date>        
1      1    1/2/2020      55     1       0           0.1           0.4          NA   2020-01-02    
2      2    2/1/2020      59     0       0          NA             0.2           0.9 2020-02-02    
3      3    1/6/2020      66     1       1           0.7          NA            NA   2020-01-06

我确定有一种方法可以将所有这些组合到一个函数调用中，但有时丑陋只是更快。

Answer 2

为了扩展@Dave2e 响应，您可以使用 group_by 然后 min 通过 pat_id 得到 first_cxr_date，这让您可以组成一个简洁的功能解决方案。

library(tibble)
library(dplyr)
library(tidyr)

df <- 
tribble( 
~pat_id,  ~index_date,  ~cxr_date,  ~delta_date,  ~cxr_measure,  ~age,  ~admit,  ~outcome, 
        1,  '1/2/2020',  '1/2/2020',          0,          0.1,   55,      1,        0, 
        1,  '1/2/2020',   '1/3/2020',           1,          0.3,   55,      1,        0, 
        1,  '1/2/2020',  '1/3/2020',          1,          0.5,   55,      1,        0, 
        2,  '2/1/2020',   '2/2/2020',           1,          0.2,   59,      0,        0, 
        2,  '2/1/2020',  '2/3/2020',          2,          0.9,   59,      0,        0, 
        3,  '1/6/2020',   '1/6/2020',           0,          0.7,   66,      1,        1)

df %>% 
  group_by(pat_id) %>% mutate(first_cxr_date = min(cxr_date)) %>% ungroup() %>% # set first_cxr_date as min of group by pat_id
  pivot_wider(id_cols = -c(delta_date, cxr_measure, cxr_date) 
              , names_from = delta_date # column names from delta_date
              , values_from = cxr_measure
              , names_prefix = 'cxr_measure_' # paste string to column names
              , values_fn = mean # combine with mean
              )

# A tibble: 3 x 9
  pat_id index_date   age admit outcome first_cxr_date cxr_measure_0 cxr_measure_1 cxr_measure_2
   <dbl> <chr>      <dbl> <dbl>   <dbl> <chr>                  <dbl>         <dbl>         <dbl>
1      1 1/2/2020      55     1       0 1/2/2020                 0.1           0.4          NA  
2      2 2/1/2020      59     0       0 2/2/2020                NA             0.2           0.9
3      3 1/6/2020      66     1       1 1/6/2020                 0.7          NA            NA

Answer 3

特别感谢亲爱的@Onyambu 先生，他今天教会了我一个宝贵的观点。

您也可以使用以下解决方案。请注意 .value ，当有多个列名要从数据创建时，它对 pivot_longer 特别有用。这里它告诉 pivot_wider 名称的一部分实际上是我们从中获取值的列的名称。

library(dplyr)
library(tidyr)


df %>%
  group_by(pat_id) %>%
  mutate(id = row_number()) %>%
  pivot_wider(names_from = delta_date, values_from = cxr_measure, 
              names_glue = "{.value}_{delta_date}") %>%
  mutate(across(cxr_measure_0:cxr_measure_2, ~ mean(.x, na.rm = TRUE))) %>%
  select(-id) %>%
  slice_head(n = 1)


# A tibble: 3 x 9
# Groups:   pat_id [3]
  pat_id index_date cxr_date   age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2
   <int> <chr>      <chr>    <int> <int>   <int>         <dbl>         <dbl>         <dbl>
1      1 1/2/2020   1/2/2020    55     1       0           0.1           0.4         NaN  
2      2 2/1/2020   2/2/2020    59     0       0         NaN             0.2           0.9
3      3 1/6/2020   1/6/2020    66     1       1           0.7         NaN           NaN

使用 pivot_wider 或 R 的类似函数，重复测量数据

Using pivot_wider or similar function with R with repeat measurement data

r

pivot-table

reshape