Pivot_longer 多列重复测量数据

Pivot_longer for multiple columns of repeated measurements data

我正在尝试使用 dplyr 包中的 pivot_longer 函数将我的数据转换为长格式。当前的广泛数据涉及 3 次重复测量患者的年龄、他们的收缩压以及他们是否使用降压药物 (med_hypt),以及时间不变的 'sex' 变量。

示例数据和我尝试过的内容:

library(tidyverse)
library(dplyr)
library(magrittr)

wide_data <- structure(list(id = c(12002, 17001, 17002, 42001, 66001, 82002, 166002, 177001, 177002, 240001), 
                            sex = structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), 
                                            .Label = c("men", "women"), class = "factor"), 
                            time1_age = c(71.2, 67.9, 66.5, 57.7, 57.1, 60.9, 80.9, 59.7, 58.2, 66.6), 
                            time1_systolicBP = c(102, 152, NA_real_, 170, 151, 135, 162, 133, 131, 117), 
                            time1_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
                            time2_age = c(74.2, 69.2, 67.8, 58.9, 58.4, 62.5, 82.2, 61, 59.5, 67.8), 
                            time2_systolicBP = c(NA_real_, 146, NA_real_, 151, 129, 129, 137, 144, NA_real_, 132), 
                            time2_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
                            time3_age = c(78, 74.2, 72.8, 64.1, 63.3, 67.7, 87.1, 66, 64.5, 72.9), 
                            time3_systolicBP = c(NA_real_, 160.5, NA_real_, 171, 135, 160, 151, 166, 129, 150.5), 
                            time3_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), 
                       row.names = c(NA, 10L), class = "data.frame")

# Pivoting to a longer format
long_data <- wide_data %>% 
  pivot_longer(
    cols=!id,
    names_to=c(".value", "time"), 
    names_sep="_", 
    values_drop_na=FALSE
  )

这会产生以下小标题:

# A tibble: 40 x 6
      id time       sex   time1 time2 time3
   <dbl> <chr>      <fct> <dbl> <dbl> <dbl>
 1 12002 NA         women  NA    NA    NA  
 2 12002 age        NA     71.2  74.2  78  
 3 12002 systolicBP NA    102    NA    NA  
 4 12002 med        NA      0     0     0  
 5 17001 NA         men    NA    NA    NA  
 6 17001 age        NA     67.9  69.2  74.2
 7 17001 systolicBP NA    152   146   160. 
 8 17001 med        NA      0     0     0  
 9 17002 NA         women  NA    NA    NA  
10 17002 age        NA     66.5  67.8  72.8
# ... with 30 more rows

我想要的是列名是 id、time、age、sex、systolicBP 和 med_hypt。每个患者 3 行对应于 3 次重复测量。

有什么帮助吗?

如果我没理解错的话

   wide_data %>% 
      pivot_longer(
        cols=-c(id, sex),
        names_to=c(".value", "time"), 
        names_sep = "_", 
        values_drop_na=FALSE
      )

# A tibble: 30 x 6
      id sex   time       time1 time2 time3
   <dbl> <fct> <chr>      <dbl> <dbl> <dbl>
 1 12002 women age         71.2  74.2  78  
 2 12002 women systolicBP 102    NA    NA  
 3 12002 women med          0     0     0  
 4 17001 men   age         67.9  69.2  74.2
 5 17001 men   systolicBP 152   146   160. 
 6 17001 men   med          0     0     0  
 7 17002 women age         66.5  67.8  72.8
 8 17002 women systolicBP  NA    NA    NA  
 9 17002 women med          0     0     0  
10 42001 men   age         57.7  58.9  64.1

由于在某些列名称中有多个下划线,因此最好使用 names_pattern 而不是 names_sepnames_pattern 允许我们传递灵活的正则表达式模式以从列名中捕获。

tidyr::pivot_longer(wide_data, 
    cols=-c(id, sex),
    names_to=c("time", ".value"), 
    names_pattern = '(.*?)_(.*)$', 
  )

#      id sex   time    age systolicBP med_hypt
#   <dbl> <fct> <chr> <dbl>      <dbl>    <dbl>
# 1 12002 women time1  71.2       102         0
# 2 12002 women time2  74.2        NA         0
# 3 12002 women time3  78          NA         0
# 4 17001 men   time1  67.9       152         0
# 5 17001 men   time2  69.2       146         0
# 6 17001 men   time3  74.2       160.        0
# 7 17002 women time1  66.5        NA         0
# 8 17002 women time2  67.8        NA         0
# 9 17002 women time3  72.8        NA         0
#10 42001 men   time1  57.7       170         0
# … with 20 more rows

这可能不会为已经发布的解决方案添加任何新内容,唯一的区别是用于 names_pattern 参数的 regex

  • 如果您注意到您的某些列名称由一个 _ 分隔,而其他列名称则由两个 _ 分隔。 \w+ 捕获任何单词字符,现在如果我指定我们在 \d+ 之后有一个数字,就像在 time3_age 中的 time3 一样,我们告诉 pivot_longer 存储这个time 列中 time3 对应的部分列名。然后其余的列名用于我们试图测量行 agesystolicBPmed_hypt.
  • 的变量名
  • 需要注意的是,如果我们使用\w+\d+而不是\w+,那么无论是带下划线的med_hypt还是systolicBP,其余部分都会被捕获为列名没有下划线。但是如果我们只使用 \w+ 它也可以捕获 med 并且结果列将是 hypt 而不是 med_hypt.
  • 最后,因为我定义了两个捕获组,所以我必须定义 names_patternnames_sep 以指定如何定义和分隔它们中的每一个。
library(dplyr)

wide_data %>%
  pivot_longer(!c(id, sex), names_to = c("time", ".value"), 
               names_pattern = "(\w+\d+)_(\w+)")

# A tibble: 30 x 6
      id sex   time    age systolicBP med_hypt
   <dbl> <fct> <chr> <dbl>      <dbl>    <dbl>
 1 12002 women time1  71.2       102         0
 2 12002 women time2  74.2        NA         0
 3 12002 women time3  78          NA         0
 4 17001 men   time1  67.9       152         0
 5 17001 men   time2  69.2       146         0
 6 17001 men   time3  74.2       160.        0
 7 17002 women time1  66.5        NA         0
 8 17002 women time2  67.8        NA         0
 9 17002 women time3  72.8        NA         0
10 42001 men   time1  57.7       170         0
# ... with 20 more rows