如何按 ID 对行进行分组并计算平均值和 IQR
How to group rows by ID and calculate mean and IQR
我有一个长格式数据框,my data
,其中有 101 名参与者,每人在 51 次试验中获得分数 (Event
),如下所示:
dput(head(mydata, 200))
`structure(list(Participant = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4), Score = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 0, 0, 0, 15, 45, 75, 20, -4,
6, 12, 10, 5, 0, 25, 2, 48, 17, 7, 2, 30, 32, 40, 0, 10, 32,
0, 13, -1, 0, 0, 4, 0, 0, 20, 0, 0, 0, 10, 3, 16, 9, 0, 26, 33,
9, 5, 2, 0, 0, 5, 50, 0, 0, 0, 1, 0, 0, 10, 10, 15, 0, 10, 5,
0, 0, 0, 0, 20, 79, 5, 35, 0, 0, 5, 0, 10, 10, 30, 30, 10, 25,
5, 25, 0, 75, 0, 70, 0, 0, 1, 5, 10, 0, 15, 0, 0, 55, 5, 40,
0, 1, 30, 5, 15, 30, 9, 5, 1, 20, 11, 8, 10, 30, 6, 15, 3, 15,
0, 20, 25, 16, 3, 38, 5, 15, 19, 0, 20, 0, 5, 0, 5, 0, 17, 25,
40, 0, 31, 53, 2, 30, 0, 10, 3, 13, 0, 5, 22, 5, 4, 20, 0), Event = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L,
18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L,
31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L,
44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L,
20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L,
33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L,
46L, 47L, 48L, 49L, 50L, 51L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L,
22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L,
35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L
)), row.names = c(1L, 105L, 209L, 313L, 417L, 521L, 625L, 729L,
833L, 937L, 1041L, 1145L, 1249L, 1353L, 1457L, 1561L, 1665L,
1769L, 1873L, 1977L, 2081L, 2185L, 2289L, 2393L, 2497L, 2601L,
2705L, 2809L, 2913L, 3017L, 3121L, 3225L, 3329L, 3433L, 3537L,
3641L, 3745L, 3849L, 3953L, 4057L, 4161L, 4265L, 4369L, 4473L,
4577L, 4681L, 4785L, 4889L, 4993L, 5097L, 5201L, 2L, 106L, 210L,
314L, 418L, 522L, 626L, 730L, 834L, 938L, 1042L, 1146L, 1250L,
1354L, 1458L, 1562L, 1666L, 1770L, 1874L, 1978L, 2082L, 2186L,
2290L, 2394L, 2498L, 2602L, 2706L, 2810L, 2914L, 3018L, 3122L,
3226L, 3330L, 3434L, 3538L, 3642L, 3746L, 3850L, 3954L, 4058L,
4162L, 4266L, 4370L, 4474L, 4578L, 4682L, 4786L, 4890L, 4994L,
5098L, 5202L, 3L, 107L, 211L, 315L, 419L, 523L, 627L, 731L, 835L,
939L, 1043L, 1147L, 1251L, 1355L, 1459L, 1563L, 1667L, 1771L,
1875L, 1979L, 2083L, 2187L, 2291L, 2395L, 2499L, 2603L, 2707L,
2811L, 2915L, 3019L, 3123L, 3227L, 3331L, 3435L, 3539L, 3643L,
3747L, 3851L, 3955L, 4059L, 4163L, 4267L, 4371L, 4475L, 4579L,
4683L, 4787L, 4891L, 4995L, 5099L, 5203L, 4L, 108L, 212L, 316L,
420L, 524L, 628L, 732L, 836L, 940L, 1044L, 1148L, 1252L, 1356L,
1460L, 1564L, 1668L, 1772L, 1876L, 1980L, 2084L, 2188L, 2292L,
2396L, 2500L, 2604L, 2708L, 2812L, 2916L, 3020L, 3124L, 3228L,
3332L, 3436L, 3540L, 3644L, 3748L, 3852L, 3956L, 4060L, 4164L,
4268L, 4372L, 4476L, 4580L, 4684L, 4788L), class = "data.frame")`
我想计算 51 次试验中每个 Participant
的 Score
的均值和四分位距,此处标记为 Event
。然后我想删除参与者的 Score
的观察结果,如果超过四分位数范围之外的 ±3。
最有效的方法是什么?
我研究了 reshape2
中的 cast
函数,希望先将数据帧转换为宽格式,但一直没有成功。
我也研究过按 Participant
对行进行分组,但没有找到足够清晰的教程来指导我完成此操作。
下面按照问题的要求进行,将 Score
按 Participant
分组。
agg <- aggregate(Score ~ Participant, mydata, function(x){
qq <- quantile(x, probs = c(1, 3)/4)
iqr <- diff(qq)
lo <- qq[1] - 1.5*iqr
hi <- qq[2] + 1.5*iqr
c(Mean = mean(x), IQR = unname(iqr), lower = lo, high = hi)
})
agg <- cbind(agg[1], agg[[2]])
agg
# Participant Mean IQR lower.25% high.75%
#1 1 0.3529412 0.0 0.00 0.00
#2 2 12.4117647 18.5 -27.75 46.25
#3 3 13.6666667 17.5 -26.25 43.75
#4 4 12.4255319 17.0 -22.50 45.50
请注意 Participant == 1
的 IQR
为零,因为除 Score == 18
外所有值都是 0
。这是异常值,不在 Q1
和 Q3
之间。
mrg <- merge(mydata, agg[c(1, 4, 5)])
inx <- apply(mrg[c(2, 4, 5)], 1, function(x) x[1] < x[2] | x[1] > x[3])
result <- mydata[!inx, ]
row.names(result) <- NULL
head(result)
# Participant Score Event
#1 1 0 1
#2 1 0 2
#3 1 0 3
#4 1 0 4
#5 1 0 5
#6 1 0 6
最后清理。
rm(inx, mrg)
我有一个长格式数据框,my data
,其中有 101 名参与者,每人在 51 次试验中获得分数 (Event
),如下所示:
dput(head(mydata, 200))
`structure(list(Participant = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4), Score = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 0, 0, 0, 15, 45, 75, 20, -4,
6, 12, 10, 5, 0, 25, 2, 48, 17, 7, 2, 30, 32, 40, 0, 10, 32,
0, 13, -1, 0, 0, 4, 0, 0, 20, 0, 0, 0, 10, 3, 16, 9, 0, 26, 33,
9, 5, 2, 0, 0, 5, 50, 0, 0, 0, 1, 0, 0, 10, 10, 15, 0, 10, 5,
0, 0, 0, 0, 20, 79, 5, 35, 0, 0, 5, 0, 10, 10, 30, 30, 10, 25,
5, 25, 0, 75, 0, 70, 0, 0, 1, 5, 10, 0, 15, 0, 0, 55, 5, 40,
0, 1, 30, 5, 15, 30, 9, 5, 1, 20, 11, 8, 10, 30, 6, 15, 3, 15,
0, 20, 25, 16, 3, 38, 5, 15, 19, 0, 20, 0, 5, 0, 5, 0, 17, 25,
40, 0, 31, 53, 2, 30, 0, 10, 3, 13, 0, 5, 22, 5, 4, 20, 0), Event = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L,
18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L,
31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L,
44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L,
20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L,
33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L,
46L, 47L, 48L, 49L, 50L, 51L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L,
22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L,
35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L
)), row.names = c(1L, 105L, 209L, 313L, 417L, 521L, 625L, 729L,
833L, 937L, 1041L, 1145L, 1249L, 1353L, 1457L, 1561L, 1665L,
1769L, 1873L, 1977L, 2081L, 2185L, 2289L, 2393L, 2497L, 2601L,
2705L, 2809L, 2913L, 3017L, 3121L, 3225L, 3329L, 3433L, 3537L,
3641L, 3745L, 3849L, 3953L, 4057L, 4161L, 4265L, 4369L, 4473L,
4577L, 4681L, 4785L, 4889L, 4993L, 5097L, 5201L, 2L, 106L, 210L,
314L, 418L, 522L, 626L, 730L, 834L, 938L, 1042L, 1146L, 1250L,
1354L, 1458L, 1562L, 1666L, 1770L, 1874L, 1978L, 2082L, 2186L,
2290L, 2394L, 2498L, 2602L, 2706L, 2810L, 2914L, 3018L, 3122L,
3226L, 3330L, 3434L, 3538L, 3642L, 3746L, 3850L, 3954L, 4058L,
4162L, 4266L, 4370L, 4474L, 4578L, 4682L, 4786L, 4890L, 4994L,
5098L, 5202L, 3L, 107L, 211L, 315L, 419L, 523L, 627L, 731L, 835L,
939L, 1043L, 1147L, 1251L, 1355L, 1459L, 1563L, 1667L, 1771L,
1875L, 1979L, 2083L, 2187L, 2291L, 2395L, 2499L, 2603L, 2707L,
2811L, 2915L, 3019L, 3123L, 3227L, 3331L, 3435L, 3539L, 3643L,
3747L, 3851L, 3955L, 4059L, 4163L, 4267L, 4371L, 4475L, 4579L,
4683L, 4787L, 4891L, 4995L, 5099L, 5203L, 4L, 108L, 212L, 316L,
420L, 524L, 628L, 732L, 836L, 940L, 1044L, 1148L, 1252L, 1356L,
1460L, 1564L, 1668L, 1772L, 1876L, 1980L, 2084L, 2188L, 2292L,
2396L, 2500L, 2604L, 2708L, 2812L, 2916L, 3020L, 3124L, 3228L,
3332L, 3436L, 3540L, 3644L, 3748L, 3852L, 3956L, 4060L, 4164L,
4268L, 4372L, 4476L, 4580L, 4684L, 4788L), class = "data.frame")`
我想计算 51 次试验中每个 Participant
的 Score
的均值和四分位距,此处标记为 Event
。然后我想删除参与者的 Score
的观察结果,如果超过四分位数范围之外的 ±3。
最有效的方法是什么?
我研究了 reshape2
中的 cast
函数,希望先将数据帧转换为宽格式,但一直没有成功。
我也研究过按 Participant
对行进行分组,但没有找到足够清晰的教程来指导我完成此操作。
下面按照问题的要求进行,将 Score
按 Participant
分组。
agg <- aggregate(Score ~ Participant, mydata, function(x){
qq <- quantile(x, probs = c(1, 3)/4)
iqr <- diff(qq)
lo <- qq[1] - 1.5*iqr
hi <- qq[2] + 1.5*iqr
c(Mean = mean(x), IQR = unname(iqr), lower = lo, high = hi)
})
agg <- cbind(agg[1], agg[[2]])
agg
# Participant Mean IQR lower.25% high.75%
#1 1 0.3529412 0.0 0.00 0.00
#2 2 12.4117647 18.5 -27.75 46.25
#3 3 13.6666667 17.5 -26.25 43.75
#4 4 12.4255319 17.0 -22.50 45.50
请注意 Participant == 1
的 IQR
为零,因为除 Score == 18
外所有值都是 0
。这是异常值,不在 Q1
和 Q3
之间。
mrg <- merge(mydata, agg[c(1, 4, 5)])
inx <- apply(mrg[c(2, 4, 5)], 1, function(x) x[1] < x[2] | x[1] > x[3])
result <- mydata[!inx, ]
row.names(result) <- NULL
head(result)
# Participant Score Event
#1 1 0 1
#2 1 0 2
#3 1 0 3
#4 1 0 4
#5 1 0 5
#6 1 0 6
最后清理。
rm(inx, mrg)