Pivot_longer 多列重复测量数据
Pivot_longer for multiple columns of repeated measurements data
我正在尝试使用 dplyr
包中的 pivot_longer
函数将我的数据转换为长格式。当前的广泛数据涉及 3 次重复测量患者的年龄、他们的收缩压以及他们是否使用降压药物 (med_hypt),以及时间不变的 'sex' 变量。
示例数据和我尝试过的内容:
library(tidyverse)
library(dplyr)
library(magrittr)
wide_data <- structure(list(id = c(12002, 17001, 17002, 42001, 66001, 82002, 166002, 177001, 177002, 240001),
sex = structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L),
.Label = c("men", "women"), class = "factor"),
time1_age = c(71.2, 67.9, 66.5, 57.7, 57.1, 60.9, 80.9, 59.7, 58.2, 66.6),
time1_systolicBP = c(102, 152, NA_real_, 170, 151, 135, 162, 133, 131, 117),
time1_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
time2_age = c(74.2, 69.2, 67.8, 58.9, 58.4, 62.5, 82.2, 61, 59.5, 67.8),
time2_systolicBP = c(NA_real_, 146, NA_real_, 151, 129, 129, 137, 144, NA_real_, 132),
time2_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
time3_age = c(78, 74.2, 72.8, 64.1, 63.3, 67.7, 87.1, 66, 64.5, 72.9),
time3_systolicBP = c(NA_real_, 160.5, NA_real_, 171, 135, 160, 151, 166, 129, 150.5),
time3_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)),
row.names = c(NA, 10L), class = "data.frame")
# Pivoting to a longer format
long_data <- wide_data %>%
pivot_longer(
cols=!id,
names_to=c(".value", "time"),
names_sep="_",
values_drop_na=FALSE
)
这会产生以下小标题:
# A tibble: 40 x 6
id time sex time1 time2 time3
<dbl> <chr> <fct> <dbl> <dbl> <dbl>
1 12002 NA women NA NA NA
2 12002 age NA 71.2 74.2 78
3 12002 systolicBP NA 102 NA NA
4 12002 med NA 0 0 0
5 17001 NA men NA NA NA
6 17001 age NA 67.9 69.2 74.2
7 17001 systolicBP NA 152 146 160.
8 17001 med NA 0 0 0
9 17002 NA women NA NA NA
10 17002 age NA 66.5 67.8 72.8
# ... with 30 more rows
我想要的是列名是 id、time、age、sex、systolicBP 和 med_hypt。每个患者 3 行对应于 3 次重复测量。
有什么帮助吗?
如果我没理解错的话
wide_data %>%
pivot_longer(
cols=-c(id, sex),
names_to=c(".value", "time"),
names_sep = "_",
values_drop_na=FALSE
)
# A tibble: 30 x 6
id sex time time1 time2 time3
<dbl> <fct> <chr> <dbl> <dbl> <dbl>
1 12002 women age 71.2 74.2 78
2 12002 women systolicBP 102 NA NA
3 12002 women med 0 0 0
4 17001 men age 67.9 69.2 74.2
5 17001 men systolicBP 152 146 160.
6 17001 men med 0 0 0
7 17002 women age 66.5 67.8 72.8
8 17002 women systolicBP NA NA NA
9 17002 women med 0 0 0
10 42001 men age 57.7 58.9 64.1
由于在某些列名称中有多个下划线,因此最好使用 names_pattern
而不是 names_sep
。 names_pattern
允许我们传递灵活的正则表达式模式以从列名中捕获。
tidyr::pivot_longer(wide_data,
cols=-c(id, sex),
names_to=c("time", ".value"),
names_pattern = '(.*?)_(.*)$',
)
# id sex time age systolicBP med_hypt
# <dbl> <fct> <chr> <dbl> <dbl> <dbl>
# 1 12002 women time1 71.2 102 0
# 2 12002 women time2 74.2 NA 0
# 3 12002 women time3 78 NA 0
# 4 17001 men time1 67.9 152 0
# 5 17001 men time2 69.2 146 0
# 6 17001 men time3 74.2 160. 0
# 7 17002 women time1 66.5 NA 0
# 8 17002 women time2 67.8 NA 0
# 9 17002 women time3 72.8 NA 0
#10 42001 men time1 57.7 170 0
# … with 20 more rows
这可能不会为已经发布的解决方案添加任何新内容,唯一的区别是用于 names_pattern
参数的 regex
。
- 如果您注意到您的某些列名称由一个
_
分隔,而其他列名称则由两个 _
分隔。 \w+
捕获任何单词字符,现在如果我指定我们在 \d+
之后有一个数字,就像在 time3_age
中的 time3
一样,我们告诉 pivot_longer
存储这个time
列中 time3
对应的部分列名。然后其余的列名用于我们试图测量行 age
、systolicBP
和 med_hypt
. 的变量名
- 需要注意的是,如果我们使用
\w+\d+
而不是\w+
,那么无论是带下划线的med_hypt
还是systolicBP
,其余部分都会被捕获为列名没有下划线。但是如果我们只使用 \w+
它也可以捕获 med 并且结果列将是 hypt
而不是 med_hypt
.
- 最后,因为我定义了两个捕获组,所以我必须定义
names_pattern
或 names_sep
以指定如何定义和分隔它们中的每一个。
library(dplyr)
wide_data %>%
pivot_longer(!c(id, sex), names_to = c("time", ".value"),
names_pattern = "(\w+\d+)_(\w+)")
# A tibble: 30 x 6
id sex time age systolicBP med_hypt
<dbl> <fct> <chr> <dbl> <dbl> <dbl>
1 12002 women time1 71.2 102 0
2 12002 women time2 74.2 NA 0
3 12002 women time3 78 NA 0
4 17001 men time1 67.9 152 0
5 17001 men time2 69.2 146 0
6 17001 men time3 74.2 160. 0
7 17002 women time1 66.5 NA 0
8 17002 women time2 67.8 NA 0
9 17002 women time3 72.8 NA 0
10 42001 men time1 57.7 170 0
# ... with 20 more rows
我正在尝试使用 dplyr
包中的 pivot_longer
函数将我的数据转换为长格式。当前的广泛数据涉及 3 次重复测量患者的年龄、他们的收缩压以及他们是否使用降压药物 (med_hypt),以及时间不变的 'sex' 变量。
示例数据和我尝试过的内容:
library(tidyverse)
library(dplyr)
library(magrittr)
wide_data <- structure(list(id = c(12002, 17001, 17002, 42001, 66001, 82002, 166002, 177001, 177002, 240001),
sex = structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L),
.Label = c("men", "women"), class = "factor"),
time1_age = c(71.2, 67.9, 66.5, 57.7, 57.1, 60.9, 80.9, 59.7, 58.2, 66.6),
time1_systolicBP = c(102, 152, NA_real_, 170, 151, 135, 162, 133, 131, 117),
time1_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
time2_age = c(74.2, 69.2, 67.8, 58.9, 58.4, 62.5, 82.2, 61, 59.5, 67.8),
time2_systolicBP = c(NA_real_, 146, NA_real_, 151, 129, 129, 137, 144, NA_real_, 132),
time2_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
time3_age = c(78, 74.2, 72.8, 64.1, 63.3, 67.7, 87.1, 66, 64.5, 72.9),
time3_systolicBP = c(NA_real_, 160.5, NA_real_, 171, 135, 160, 151, 166, 129, 150.5),
time3_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)),
row.names = c(NA, 10L), class = "data.frame")
# Pivoting to a longer format
long_data <- wide_data %>%
pivot_longer(
cols=!id,
names_to=c(".value", "time"),
names_sep="_",
values_drop_na=FALSE
)
这会产生以下小标题:
# A tibble: 40 x 6
id time sex time1 time2 time3
<dbl> <chr> <fct> <dbl> <dbl> <dbl>
1 12002 NA women NA NA NA
2 12002 age NA 71.2 74.2 78
3 12002 systolicBP NA 102 NA NA
4 12002 med NA 0 0 0
5 17001 NA men NA NA NA
6 17001 age NA 67.9 69.2 74.2
7 17001 systolicBP NA 152 146 160.
8 17001 med NA 0 0 0
9 17002 NA women NA NA NA
10 17002 age NA 66.5 67.8 72.8
# ... with 30 more rows
我想要的是列名是 id、time、age、sex、systolicBP 和 med_hypt。每个患者 3 行对应于 3 次重复测量。
有什么帮助吗?
如果我没理解错的话
wide_data %>%
pivot_longer(
cols=-c(id, sex),
names_to=c(".value", "time"),
names_sep = "_",
values_drop_na=FALSE
)
# A tibble: 30 x 6
id sex time time1 time2 time3
<dbl> <fct> <chr> <dbl> <dbl> <dbl>
1 12002 women age 71.2 74.2 78
2 12002 women systolicBP 102 NA NA
3 12002 women med 0 0 0
4 17001 men age 67.9 69.2 74.2
5 17001 men systolicBP 152 146 160.
6 17001 men med 0 0 0
7 17002 women age 66.5 67.8 72.8
8 17002 women systolicBP NA NA NA
9 17002 women med 0 0 0
10 42001 men age 57.7 58.9 64.1
由于在某些列名称中有多个下划线,因此最好使用 names_pattern
而不是 names_sep
。 names_pattern
允许我们传递灵活的正则表达式模式以从列名中捕获。
tidyr::pivot_longer(wide_data,
cols=-c(id, sex),
names_to=c("time", ".value"),
names_pattern = '(.*?)_(.*)$',
)
# id sex time age systolicBP med_hypt
# <dbl> <fct> <chr> <dbl> <dbl> <dbl>
# 1 12002 women time1 71.2 102 0
# 2 12002 women time2 74.2 NA 0
# 3 12002 women time3 78 NA 0
# 4 17001 men time1 67.9 152 0
# 5 17001 men time2 69.2 146 0
# 6 17001 men time3 74.2 160. 0
# 7 17002 women time1 66.5 NA 0
# 8 17002 women time2 67.8 NA 0
# 9 17002 women time3 72.8 NA 0
#10 42001 men time1 57.7 170 0
# … with 20 more rows
这可能不会为已经发布的解决方案添加任何新内容,唯一的区别是用于 names_pattern
参数的 regex
。
- 如果您注意到您的某些列名称由一个
_
分隔,而其他列名称则由两个_
分隔。\w+
捕获任何单词字符,现在如果我指定我们在\d+
之后有一个数字,就像在time3_age
中的time3
一样,我们告诉pivot_longer
存储这个time
列中time3
对应的部分列名。然后其余的列名用于我们试图测量行age
、systolicBP
和med_hypt
. 的变量名
- 需要注意的是,如果我们使用
\w+\d+
而不是\w+
,那么无论是带下划线的med_hypt
还是systolicBP
,其余部分都会被捕获为列名没有下划线。但是如果我们只使用\w+
它也可以捕获 med 并且结果列将是hypt
而不是med_hypt
. - 最后,因为我定义了两个捕获组,所以我必须定义
names_pattern
或names_sep
以指定如何定义和分隔它们中的每一个。
library(dplyr)
wide_data %>%
pivot_longer(!c(id, sex), names_to = c("time", ".value"),
names_pattern = "(\w+\d+)_(\w+)")
# A tibble: 30 x 6
id sex time age systolicBP med_hypt
<dbl> <fct> <chr> <dbl> <dbl> <dbl>
1 12002 women time1 71.2 102 0
2 12002 women time2 74.2 NA 0
3 12002 women time3 78 NA 0
4 17001 men time1 67.9 152 0
5 17001 men time2 69.2 146 0
6 17001 men time3 74.2 160. 0
7 17002 women time1 66.5 NA 0
8 17002 women time2 67.8 NA 0
9 17002 women time3 72.8 NA 0
10 42001 men time1 57.7 170 0
# ... with 20 more rows