当 id 变量在列 header 中编码时,将数据从宽格式转换为长格式
Converting data from wide to long format when id variables are encoded in column header
我对 R 比较陌生,有如下宽格式数据
subject_id age sex treat1.1.param1 treat1.1.param2 treat1.2.param1 treat1.2.param2
-----------------------------------------------------------------------------------------------
1 23 M 1 2 3 4
2 25 W 5 6 7 8
这是我们针对给定治疗(此处为 treat1)的多个受试者的数据,在多轮重复测量(此处为第 1 轮和第 2 轮)中测量多个参数(此处为 param1 和 param2)。该受试者条目所属的处理、轮次和参数的信息编码在列header中,如上例所示。
我想把长格式的数据举例如下:
subject_id age sex treatment round param1 param2
------------------------------------------------------------------------------------------
1 23 M treat1 1 1 2
1 23 M treat1 2 3 4
2 25 W treat1 1 5 6
2 25 W treat1 2 7 8
即id变量识别单个观察值分别是subject_id、treatment、round。但是由于后两个变量是使用点作为分隔符在 header 列中编码的,所以我不知道如何从上面的宽格式转换为长格式。所有使用标准示例(使用 reshape2
或 tidyr
)的尝试都失败了。因为在现实中,我每 30 轮有 12 次治疗,每轮大约有 50 个参数,相对手动的方法对我帮助不大。
我们可以使用 tidyr
中的 pivot_longer
指定 names_to
和 names_pattern
参数。
tidyr::pivot_longer(df,
cols = starts_with("treat"),
names_to = c("treatmeant", "round", ".value"),
names_pattern = "(\w+)\.(\d+)\.(\w+)")
# subject_id age sex treatmeant round param1 param2
# <int> <int> <fct> <chr> <chr> <int> <int>
#1 1 23 M treat1 1 1 2
#2 1 23 M treat1 2 3 4
#3 2 25 W treat1 1 5 6
#4 2 25 W treat1 2 7 8
数据
df <- structure(list(subject_id = 1:2, age = c(23L, 25L), sex = structure(1:2,
.Label = c("M", "W"), class = "factor"),
treat1.1.param1 = c(1L, 5L), treat1.1.param2 = c(2L, 6L),
treat1.2.param1 = c(3L, 7L), treat1.2.param2 = c(4L, 8L)),
class = "data.frame", row.names = c(NA, -2L))
您可以使用 tidyr gather
、separate
和 spread
:
tibble::tibble(subject_id = 1:2,
age = c(23,25),
sex = c("M", "W"),
round_1_param_1 = c(1,5),
round_1_param_2 = c(2,6),
round_2_param_1 = c(3,7),
round_2_param_2 = c(4,8)) %>%
tidyr::gather("key", "value", -subject_id, -age, -sex) %>%
tidyr::separate(key, c("round", "param"), sep = "param") %>%
dplyr::mutate_at(vars("round", "param"), ~ tidyr::extract_numeric(.)) %>%
tidyr::spread(param, value)
# A tibble: 4 x 6
subject_id age sex round `1` `2`
<int> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 23 M 1 1 2
2 1 23 M 2 3 4
3 2 25 W 1 5 6
4 2 25 W 2 7 8
这里有一个可能的data.table
方法,
library(data.table)
dcast(melt(dd, id.vars = c("subject_id", "age", 'sex'))
[, .(subject_id, age, sex, gsub('(\w+)\.\d+\.\w+', '\1', variable),
gsub('\w+\.(\d+)\.\w+', '\1', variable),
gsub('\w+\.\d+\.(\w+)', '\1', variable), value)],
subject_id + age + sex + V4 + V5 ~ V6)
这给出了,
subject_id age sex V4 V5 param1 param2
1: 1 23 M treat1 1 1 2
2: 1 23 M treat1 2 3 4
3: 2 25 W treat1 1 5 6
4: 2 25 W treat1 2 7 8
我对 R 比较陌生,有如下宽格式数据
subject_id age sex treat1.1.param1 treat1.1.param2 treat1.2.param1 treat1.2.param2
-----------------------------------------------------------------------------------------------
1 23 M 1 2 3 4
2 25 W 5 6 7 8
这是我们针对给定治疗(此处为 treat1)的多个受试者的数据,在多轮重复测量(此处为第 1 轮和第 2 轮)中测量多个参数(此处为 param1 和 param2)。该受试者条目所属的处理、轮次和参数的信息编码在列header中,如上例所示。
我想把长格式的数据举例如下:
subject_id age sex treatment round param1 param2
------------------------------------------------------------------------------------------
1 23 M treat1 1 1 2
1 23 M treat1 2 3 4
2 25 W treat1 1 5 6
2 25 W treat1 2 7 8
即id变量识别单个观察值分别是subject_id、treatment、round。但是由于后两个变量是使用点作为分隔符在 header 列中编码的,所以我不知道如何从上面的宽格式转换为长格式。所有使用标准示例(使用 reshape2
或 tidyr
)的尝试都失败了。因为在现实中,我每 30 轮有 12 次治疗,每轮大约有 50 个参数,相对手动的方法对我帮助不大。
我们可以使用 tidyr
中的 pivot_longer
指定 names_to
和 names_pattern
参数。
tidyr::pivot_longer(df,
cols = starts_with("treat"),
names_to = c("treatmeant", "round", ".value"),
names_pattern = "(\w+)\.(\d+)\.(\w+)")
# subject_id age sex treatmeant round param1 param2
# <int> <int> <fct> <chr> <chr> <int> <int>
#1 1 23 M treat1 1 1 2
#2 1 23 M treat1 2 3 4
#3 2 25 W treat1 1 5 6
#4 2 25 W treat1 2 7 8
数据
df <- structure(list(subject_id = 1:2, age = c(23L, 25L), sex = structure(1:2,
.Label = c("M", "W"), class = "factor"),
treat1.1.param1 = c(1L, 5L), treat1.1.param2 = c(2L, 6L),
treat1.2.param1 = c(3L, 7L), treat1.2.param2 = c(4L, 8L)),
class = "data.frame", row.names = c(NA, -2L))
您可以使用 tidyr gather
、separate
和 spread
:
tibble::tibble(subject_id = 1:2,
age = c(23,25),
sex = c("M", "W"),
round_1_param_1 = c(1,5),
round_1_param_2 = c(2,6),
round_2_param_1 = c(3,7),
round_2_param_2 = c(4,8)) %>%
tidyr::gather("key", "value", -subject_id, -age, -sex) %>%
tidyr::separate(key, c("round", "param"), sep = "param") %>%
dplyr::mutate_at(vars("round", "param"), ~ tidyr::extract_numeric(.)) %>%
tidyr::spread(param, value)
# A tibble: 4 x 6
subject_id age sex round `1` `2`
<int> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 23 M 1 1 2
2 1 23 M 2 3 4
3 2 25 W 1 5 6
4 2 25 W 2 7 8
这里有一个可能的data.table
方法,
library(data.table)
dcast(melt(dd, id.vars = c("subject_id", "age", 'sex'))
[, .(subject_id, age, sex, gsub('(\w+)\.\d+\.\w+', '\1', variable),
gsub('\w+\.(\d+)\.\w+', '\1', variable),
gsub('\w+\.\d+\.(\w+)', '\1', variable), value)],
subject_id + age + sex + V4 + V5 ~ V6)
这给出了,
subject_id age sex V4 V5 param1 param2 1: 1 23 M treat1 1 1 2 2: 1 23 M treat1 2 3 4 3: 2 25 W treat1 1 5 6 4: 2 25 W treat1 2 7 8