使用大量连续的几列变量重塑分组数据框
Reshaping a grouped data frame with large sets of several successive columns of variables
我正在使用与此类似的数据框:
set.seed(1)
test.df <- data.frame(Treatment= "CI",
period = seq(1, 3),
subject= 1,
X.1. = rnorm(6),
X.2. = rnorm(6),
X.3. = rnorm(6),
Y.1. = rnorm(6),
Y.2. = rnorm(6),
Y.3. = rnorm(6))
> test.df
Treatment period subject X.1. X.2. X.3. Y.1. Y.2. Y.3.
1 CI 1 1 -0.6264538 0.4874291 -0.62124058 0.82122120 0.61982575 1.35867955
2 CI 2 1 0.1836433 0.7383247 -2.21469989 0.59390132 -0.05612874 -0.10278773
3 CI 3 1 -0.8356286 0.5757814 1.12493092 0.91897737 -0.15579551 0.38767161
4 CI 1 1 1.5952808 -0.3053884 -0.04493361 0.78213630 -1.47075238 -0.05380504
5 CI 2 1 0.3295078 1.5117812 -0.01619026 0.07456498 -0.47815006 -1.37705956
6 CI 3 1 -0.8204684 0.3898432 0.94383621 -1.98935170 0.41794156 -0.41499456
我希望我的数据如下所示:
Treatment period subject Game X Y
1 CI 1 1 1 -0.6264538 0.82122120
其中 Game 来自 1:3,并且已针对每组 c(治疗,期间)完成此操作。但在真实数据中,除了 X 和 Y 之外,还有大约 16 个其他类似的变量。受到 this post 的启发,我尝试通过以下方式进行:
final.df<- test.df %>%
group_by(Treatment, period) %>%
reshape(idvar=1:3, varying=4:ncol(test.df), sep=".", direction='long',times=1:3)
我收到以下错误
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L],
: invalid 'row.names' length
df.stk <- tidyr::gather(test.df, "xy", "value", -Treatment:-subject)
df.sep <- tidyr::separate(df.stk, "xy", c("xy", "Game", "temp"), sep="\.")[, -6]
df.final <- reshape2::dcast(df.sep, Treatment + period + subject + Game ~ xy, fun.aggregate=mean)
但您需要将所有 X 和 Y 名称设为小写或大写
修改名称后,您仍然面临两个问题:(1)您当前的ID 不唯一; (2) reshape
不喜欢 tbl_df
。因此,如果您坚持使用 "dplyr":
,则需要执行以下操作
names(test.df) <- toupper(names(test.df))
test.df %>%
group_by(TREATMENT, PERIOD) %>%
mutate(NEW_ID = sequence(n())) %>%
data.frame %>%
reshape(varying = grep("^X\.|^Y\.", names(test.df)), sep=".", direction = "long")
# TREATMENT PERIOD SUBJECT NEW_ID time X Y id
# 1.1 CI 1 1 1 1 0.898037653 -1.62380441 1
# 2.1 CI 2 1 1 1 -0.265867132 1.19260758 2
# 3.1 CI 3 1 1 1 0.478254223 0.37225231 3
# 4.1 CI 1 1 2 1 0.193781526 0.78440441 4
# 5.1 CI 2 1 2 1 -0.785203396 -0.88621250 5
# 6.1 CI 3 1 2 1 0.341740150 -0.67919816 6
# 1.2 CI 1 1 1 2 -1.808196090 1.64211603 1
# 2.2 CI 2 1 1 2 -0.937445606 -0.35388758 2
# 3.2 CI 3 1 1 2 1.773354124 0.95633070 3
# 4.2 CI 1 1 2 2 -0.819681242 1.06421615 4
# 5.2 CI 2 1 2 2 0.003812118 -0.04835364 5
# 6.2 CI 3 1 2 2 0.226081490 0.50687855 6
# 1.3 CI 1 1 1 3 0.822497674 -0.55875020 1
# 2.3 CI 2 1 1 3 0.382695603 -0.83661977 2
# 3.3 CI 3 1 1 3 0.066738811 -1.96761492 3
# 4.3 CI 1 1 2 3 0.854280148 -0.49335882 4
# 5.3 CI 2 1 2 3 -1.635859887 1.18322984 5
# 6.3 CI 3 1 2 3 -0.020864680 1.20997470 6
不过,我建议您从 "data.table" 查看 melt
:
library(data.table)
melt(as.data.table(setnames(test.df, toupper(names(test.df)))),
measure.vars = patterns("^X\.", "^Y\."), value.name = c("X", "Y"))
# TREATMENT PERIOD SUBJECT variable X Y
# 1: CI 1 1 1 0.898037653 -1.62380441
# 2: CI 2 1 1 -0.265867132 1.19260758
# 3: CI 3 1 1 0.478254223 0.37225231
# 4: CI 1 1 1 0.193781526 0.78440441
# 5: CI 2 1 1 -0.785203396 -0.88621250
# 6: CI 3 1 1 0.341740150 -0.67919816
# 7: CI 1 1 2 -1.808196090 1.64211603
# 8: CI 2 1 2 -0.937445606 -0.35388758
# 9: CI 3 1 2 1.773354124 0.95633070
# 10: CI 1 1 2 -0.819681242 1.06421615
# 11: CI 2 1 2 0.003812118 -0.04835364
# 12: CI 3 1 2 0.226081490 0.50687855
# 13: CI 1 1 3 0.822497674 -0.55875020
# 14: CI 2 1 3 0.382695603 -0.83661977
# 15: CI 3 1 3 0.066738811 -1.96761492
# 16: CI 1 1 3 0.854280148 -0.49335882
# 17: CI 2 1 3 -1.635859887 1.18322984
# 18: CI 3 1 3 -0.020864680 1.20997470
我正在使用与此类似的数据框:
set.seed(1)
test.df <- data.frame(Treatment= "CI",
period = seq(1, 3),
subject= 1,
X.1. = rnorm(6),
X.2. = rnorm(6),
X.3. = rnorm(6),
Y.1. = rnorm(6),
Y.2. = rnorm(6),
Y.3. = rnorm(6))
> test.df
Treatment period subject X.1. X.2. X.3. Y.1. Y.2. Y.3.
1 CI 1 1 -0.6264538 0.4874291 -0.62124058 0.82122120 0.61982575 1.35867955
2 CI 2 1 0.1836433 0.7383247 -2.21469989 0.59390132 -0.05612874 -0.10278773
3 CI 3 1 -0.8356286 0.5757814 1.12493092 0.91897737 -0.15579551 0.38767161
4 CI 1 1 1.5952808 -0.3053884 -0.04493361 0.78213630 -1.47075238 -0.05380504
5 CI 2 1 0.3295078 1.5117812 -0.01619026 0.07456498 -0.47815006 -1.37705956
6 CI 3 1 -0.8204684 0.3898432 0.94383621 -1.98935170 0.41794156 -0.41499456
我希望我的数据如下所示:
Treatment period subject Game X Y
1 CI 1 1 1 -0.6264538 0.82122120
其中 Game 来自 1:3,并且已针对每组 c(治疗,期间)完成此操作。但在真实数据中,除了 X 和 Y 之外,还有大约 16 个其他类似的变量。受到 this post 的启发,我尝试通过以下方式进行:
final.df<- test.df %>%
group_by(Treatment, period) %>%
reshape(idvar=1:3, varying=4:ncol(test.df), sep=".", direction='long',times=1:3)
我收到以下错误
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L],
: invalid 'row.names' length
df.stk <- tidyr::gather(test.df, "xy", "value", -Treatment:-subject)
df.sep <- tidyr::separate(df.stk, "xy", c("xy", "Game", "temp"), sep="\.")[, -6]
df.final <- reshape2::dcast(df.sep, Treatment + period + subject + Game ~ xy, fun.aggregate=mean)
但您需要将所有 X 和 Y 名称设为小写或大写
修改名称后,您仍然面临两个问题:(1)您当前的ID 不唯一; (2) reshape
不喜欢 tbl_df
。因此,如果您坚持使用 "dplyr":
names(test.df) <- toupper(names(test.df))
test.df %>%
group_by(TREATMENT, PERIOD) %>%
mutate(NEW_ID = sequence(n())) %>%
data.frame %>%
reshape(varying = grep("^X\.|^Y\.", names(test.df)), sep=".", direction = "long")
# TREATMENT PERIOD SUBJECT NEW_ID time X Y id
# 1.1 CI 1 1 1 1 0.898037653 -1.62380441 1
# 2.1 CI 2 1 1 1 -0.265867132 1.19260758 2
# 3.1 CI 3 1 1 1 0.478254223 0.37225231 3
# 4.1 CI 1 1 2 1 0.193781526 0.78440441 4
# 5.1 CI 2 1 2 1 -0.785203396 -0.88621250 5
# 6.1 CI 3 1 2 1 0.341740150 -0.67919816 6
# 1.2 CI 1 1 1 2 -1.808196090 1.64211603 1
# 2.2 CI 2 1 1 2 -0.937445606 -0.35388758 2
# 3.2 CI 3 1 1 2 1.773354124 0.95633070 3
# 4.2 CI 1 1 2 2 -0.819681242 1.06421615 4
# 5.2 CI 2 1 2 2 0.003812118 -0.04835364 5
# 6.2 CI 3 1 2 2 0.226081490 0.50687855 6
# 1.3 CI 1 1 1 3 0.822497674 -0.55875020 1
# 2.3 CI 2 1 1 3 0.382695603 -0.83661977 2
# 3.3 CI 3 1 1 3 0.066738811 -1.96761492 3
# 4.3 CI 1 1 2 3 0.854280148 -0.49335882 4
# 5.3 CI 2 1 2 3 -1.635859887 1.18322984 5
# 6.3 CI 3 1 2 3 -0.020864680 1.20997470 6
不过,我建议您从 "data.table" 查看 melt
:
library(data.table)
melt(as.data.table(setnames(test.df, toupper(names(test.df)))),
measure.vars = patterns("^X\.", "^Y\."), value.name = c("X", "Y"))
# TREATMENT PERIOD SUBJECT variable X Y
# 1: CI 1 1 1 0.898037653 -1.62380441
# 2: CI 2 1 1 -0.265867132 1.19260758
# 3: CI 3 1 1 0.478254223 0.37225231
# 4: CI 1 1 1 0.193781526 0.78440441
# 5: CI 2 1 1 -0.785203396 -0.88621250
# 6: CI 3 1 1 0.341740150 -0.67919816
# 7: CI 1 1 2 -1.808196090 1.64211603
# 8: CI 2 1 2 -0.937445606 -0.35388758
# 9: CI 3 1 2 1.773354124 0.95633070
# 10: CI 1 1 2 -0.819681242 1.06421615
# 11: CI 2 1 2 0.003812118 -0.04835364
# 12: CI 3 1 2 0.226081490 0.50687855
# 13: CI 1 1 3 0.822497674 -0.55875020
# 14: CI 2 1 3 0.382695603 -0.83661977
# 15: CI 3 1 3 0.066738811 -1.96761492
# 16: CI 1 1 3 0.854280148 -0.49335882
# 17: CI 2 1 3 -1.635859887 1.18322984
# 18: CI 3 1 3 -0.020864680 1.20997470