将函数应用于因子(参与者)的每个级别,以根据 R 中标准差与平均值的距离去除异常值
Apply function to each level of a factor (participant) to remove outliers based on distance from mean in standard deviation in R
我在 R 中有一个 data.frame,其中一列表示实验中的参与者 subject
,另一列表示 condition
、trial_type
,最后一列是我的数字因变量 rt
.
这是我的 data.frame:
的前 64 行使用 dput()
函数生成的数据的可重现示例
structure(list(subject = structure(c(21L, 21L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 24L, 24L,
24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L,
24L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L,
27L, 27L, 27L, 27L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L,
47L, 47L, 47L, 47L, 47L, 47L, 47L), .Label = c("1p12", "1p13",
"1p15", "1p30", "1p36", "1p39", "1p43", "1p46", "1p49", "1p59",
"1p60", "1p67", "1p69", "1p79", "1p80", "1p81", "1p84", "1p85",
"1p88", "1p9", "2p1", "2p11", "2p18", "2p2", "2p22", "2p25",
"2p3", "2p31", "2p33", "2p42", "2p44", "2p5", "2p50", "2p58",
"2p63", "2p72", "2p76", "2p78", "2p8", "2p83", "3p10", "3p16",
"3p20", "3p28", "3p32", "3p34", "3p4", "3p41", "3p47", "3p54",
"3p55", "3p56", "3p61", "3p64", "3p66", "3p7", "3p77", "3p82",
"3p86", "3p87", "4p14", "4p17", "4p19", "4p21", "4p24", "4p26",
"4p27", "4p29", "4p35", "4p37", "4p38", "4p48", "4p51", "4p57",
"4p6", "4p62", "4p68", "4p70", "4p74", "4p75"), class = "factor"),
rt = c(4303L, 5616L, 1317L, 1663L, 1353L, 645L, 648L, 457L,
2359L, 2497L, 832L, 523L, 1427L, 511L, 483L, 1300L, 873L,
1185L, 1752L, 2037L, 4849L, 2975L, 1621L, 1235L, 3008L, 1560L,
1075L, 4596L, 1129L, 1093L, 1302L, 1414L, 5542L, 2369L, 4944L,
2338L, 1274L, 1837L, 3384L, 1338L, 2002L, 1756L, 2516L, 1868L,
2017L, 1337L, 1106L, 1388L, 6812L, 5579L, 1695L, 1976L, 1897L,
4484L, 3095L, 1865L, 2283L, 1659L, 1328L, 1882L, 1483L, 1993L,
1776L, 2256L), condition = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("reliable",
"unreliable"), class = "factor"), trial_type = structure(c(2L,
2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L,
2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,
1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L,
1L, 2L, 2L), .Label = c("same", "switch"), class = "factor"),
accuracy = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L,
1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L,
0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L,
1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L,
1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L)), row.names = c(NA, -64L), class = c("tbl_df",
"tbl", "data.frame"))
我想根据 rt
中每个分数与每个参与者的平均值(以标准差为单位)的距离来删除异常值。每个参与者有 16 行,共有 80 个参与者。 R 正确地将给定参与者的 16 行中的每一行解释为一个因子水平。
我想用它来分别为 subject
中的每个级别和 trial_type
中的每个级别从 rt
中删除异常值。我用来删除 跨 所有参与者的异常值的公式是:
# Calculate mean and `sd` for each level of `trial_type`
# For 'same':
MeanSame <- mean(RTs$rt[RTs$trial_type == "same"])
SDSame <- sd(RTs$rt[RTs$trial_type == "same"])
# For 'switch':
MeanSwitch <- mean(RTs$rt[RTs$trial_type == "switch"])
SDSwitch <- sd(RTs$rt[RTs$trial_type == "switch"])
# Create upper and lower cut for level 'same' of 'trial_type':
UpperSame <- MeanSame + 2.5*SDSame
LowerSame <- MeanSame - 2.5*SDSame
# Create upper and lower cut for level 'switch' of 'trial_type':
UpperSwitch <-2.5*SDSwitch + MeanSwitch
LowerSwitch <- MeanSwitch - 2.5*SDSwitch
#Identify Outliers in Same
OutliersSameUpper <- which(RTs$rt > UpperSame & RTs$trial_type == "same")
OutliersSameLower <- which(RTs$rt < LowerSame & RTs$trial_type == "same")
# Identify Outliers in Switch
OutliersSwitchUpper <- which(RTs$rt > UpperSwitch & RTs$trial_type == "switch")
OutliersSwitchLower <- which(RTs$rt < LowerSwitch & RTs$trial_type == "switch")
# Create new data.frame without the identified outliers:
RTsClean <- RTs[-c(OutliersSameUpper,OutliersSameLower,OutliersSwitchUpper,OutliersSwitchLower),]
对于所有参与者,我这样做的方法是计算每个条件的平均值和 SD,然后在 rt
中找到高于或超出切割点的行。但是,我无法弄清楚如何为 subject
向量的每个级别执行此操作。应用函数对我不起作用,因为应用单个函数是不够的。我还需要跟踪其他变量(平均值和 sd
,以及上限和下限切点)。
在我看来,要走的路是创建一个函数,或一个 for 循环,或两者兼而有之。但这超出了我在 R 中的技能水平。
如果有人能帮助我找到应用我上面指定的异常值删除方法的最佳方法,但可以单独应用于 subject
向量的每个级别,我将不胜感激。这意味着需要为 subject
.
的每个级别指定均值和 sd
变量
如果可能,最好在 data.frame 中创建一个新列,为 rt
中的每一行指定该行是否被视为异常值。但我不知道如何实现。
在此先感谢您的帮助。
基于OP
的这个要求
I would appreciate anyone who could help me find the best way to apply the method for outlier removal I specified above, but in a way that can be applied separately to each level of the subject vector. This means that the mean and sd variables need to be specified for each level of subject.
它删除了三行作为异常值
library(dplyr)
RTs %>% group_by(subject) %>%
filter(rt <= mean(rt) + (2.5 * sd(rt)), rt >= mean(rt) - (2.5 * sd(rt)))
#> # A tibble: 61 x 5
#> # Groups: subject [4]
#> subject rt condition trial_type accuracy
#> <fct> <int> <fct> <fct> <int>
#> 1 2p1 4303 reliable switch 0
#> 2 2p1 1317 reliable switch 0
#> 3 2p1 1663 reliable same 1
#> 4 2p1 1353 reliable switch 1
#> 5 2p1 645 reliable same 1
#> 6 2p1 648 reliable same 1
#> 7 2p1 457 reliable same 1
#> 8 2p1 2359 reliable switch 0
#> 9 2p1 2497 reliable switch 0
#> 10 2p1 832 reliable same 1
#> # ... with 51 more rows
基于 OP
的进一步要求
If possible, it would be even better to create a new column in the data.frame that specified, for each row in rt, whether that row was considered to be an outlier or not. But I don't have a clue how I could achieve that.
RTs %>%
group_by(subject) %>%
mutate(OUTLIER = rt >= mean(rt) + (2.5 * sd(rt)) | rt <= mean(rt) - (2.5 * sd(rt)))
# A tibble: 64 x 6
# Groups: subject [4]
subject rt condition trial_type accuracy OUTLIER
<fct> <int> <fct> <fct> <int> <lgl>
1 2p1 4303 reliable switch 0 FALSE
2 2p1 5616 reliable switch 0 TRUE
3 2p1 1317 reliable switch 0 FALSE
4 2p1 1663 reliable same 1 FALSE
5 2p1 1353 reliable switch 1 FALSE
6 2p1 645 reliable same 1 FALSE
7 2p1 648 reliable same 1 FALSE
8 2p1 457 reliable same 1 FALSE
9 2p1 2359 reliable switch 0 FALSE
10 2p1 2497 reliable switch 0 FALSE
# ... with 54 more rows
显然 TRUE
表示它是异常值,而 FALSE
表示其他情况
BaseR 方式
RTs$outlier <- as.logical(ave(RTs$rt, RTs$subject,
FUN = function(.x) (.x >= mean(.x) + 2.5 * sd(.x)) |
(.x <= mean(.x) - 2.5 * sd(.x))))
RTs
# A tibble: 64 x 6
subject rt condition trial_type accuracy outlier
<fct> <int> <fct> <fct> <int> <lgl>
1 2p1 4303 reliable switch 0 FALSE
2 2p1 5616 reliable switch 0 TRUE
3 2p1 1317 reliable switch 0 FALSE
4 2p1 1663 reliable same 1 FALSE
5 2p1 1353 reliable switch 1 FALSE
6 2p1 645 reliable same 1 FALSE
7 2p1 648 reliable same 1 FALSE
8 2p1 457 reliable same 1 FALSE
9 2p1 2359 reliable switch 0 FALSE
10 2p1 2497 reliable switch 0 FALSE
# ... with 54 more rows
在这里您可以通过 group_by
和 summarise
获得结果。
你可以适应 -> 任何你需要的:
library(dplyr)
RTs %>%
group_by(subject, trial_type) %>%
summarise(mean= mean(rt), sd = sd(rt), Upper = 2.5*sd + mean, Lower = mean - 2.5*sd,
OutliersUpper <- rt > Upper, OutliersLower <- rt < Lower)
输出:
subject trial_type mean sd Upper Lower `OutliersUpper <- rt > Upper` `OutliersLower <- rt < Lower`
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
1 2p1 same 720. 400. 1720. -280. FALSE FALSE
2 2p1 same 720. 400. 1720. -280. FALSE FALSE
3 2p1 same 720. 400. 1720. -280. FALSE FALSE
4 2p1 same 720. 400. 1720. -280. FALSE FALSE
5 2p1 same 720. 400. 1720. -280. FALSE FALSE
6 2p1 same 720. 400. 1720. -280. FALSE FALSE
7 2p1 same 720. 400. 1720. -280. FALSE FALSE
8 2p1 same 720. 400. 1720. -280. FALSE FALSE
9 2p1 switch 2522. 1616. 6562. -1519. FALSE FALSE
你可以改编:例如
RTs %>%
group_by(subject, trial_type, condition, accuracy) %>%
summarise(mean= mean(rt), sd = sd(rt), Upper = 2.5*sd + mean, Lower = mean - 2.5*sd,
OutliersUpper <- rt > Upper, OutliersLower <- rt < Lower)
其他答案的通用版本。
首先,将您现有的过滤转换为函数:
dropOutliers <- function(.x, .y, multiplier=2.5) {
limits <- .x %>%
group_by(trial_type) %>%
summarise(
SD=sd(rt),
Mean=mean(rt),
Lower=Mean - multiplier * SD,
Upper=Mean + multiplier * SD
)
.x %>%
left_join(
limits,
by="trial_type"
) %>%
filter(rt >= Lower && rt <= Upper) %>%
select(-SD, -Mean, -Upper, -Lower)
}
我使用 .x
和 .y
作为参数名称,因为它们与 group_map()
的文档相匹配,稍后我将使用它们。 multiplier
用于测试。
现在将函数应用于以您希望的任何方式分组的数据框。 group_map
return 是 tibbles 列表,因此 row_bind()
return 值可根据需要获得单个组合的 tibble。
d %>% group_by(subject) %>% group_map(dropOutliers) %>% bind_rows()
这似乎没有删除任何行,因此应用更积极的过滤来检查:
d %>% group_by(subject) %>% group_map(dropOutliers, multiplier=1) %>% bind_rows()
# A tibble: 16 x 4
rt condition trial_type accuracy
<int> <fct> <fct> <int>
1 873 reliable switch 1
2 1185 reliable same 1
3 1752 reliable same 1
4 2037 reliable same 1
5 4849 reliable switch 1
6 2975 reliable switch 0
7 1621 reliable switch 0
8 1235 reliable same 1
9 3008 reliable switch 0
10 1560 reliable same 1
11 1075 reliable switch 0
12 4596 reliable same 1
13 1129 reliable same 1
14 1093 reliable switch 0
15 1302 reliable same 1
16 1414 reliable switch 0
我在 R 中有一个 data.frame,其中一列表示实验中的参与者 subject
,另一列表示 condition
、trial_type
,最后一列是我的数字因变量 rt
.
这是我的 data.frame:
的前 64 行使用dput()
函数生成的数据的可重现示例
structure(list(subject = structure(c(21L, 21L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 24L, 24L,
24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L,
24L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L,
27L, 27L, 27L, 27L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L,
47L, 47L, 47L, 47L, 47L, 47L, 47L), .Label = c("1p12", "1p13",
"1p15", "1p30", "1p36", "1p39", "1p43", "1p46", "1p49", "1p59",
"1p60", "1p67", "1p69", "1p79", "1p80", "1p81", "1p84", "1p85",
"1p88", "1p9", "2p1", "2p11", "2p18", "2p2", "2p22", "2p25",
"2p3", "2p31", "2p33", "2p42", "2p44", "2p5", "2p50", "2p58",
"2p63", "2p72", "2p76", "2p78", "2p8", "2p83", "3p10", "3p16",
"3p20", "3p28", "3p32", "3p34", "3p4", "3p41", "3p47", "3p54",
"3p55", "3p56", "3p61", "3p64", "3p66", "3p7", "3p77", "3p82",
"3p86", "3p87", "4p14", "4p17", "4p19", "4p21", "4p24", "4p26",
"4p27", "4p29", "4p35", "4p37", "4p38", "4p48", "4p51", "4p57",
"4p6", "4p62", "4p68", "4p70", "4p74", "4p75"), class = "factor"),
rt = c(4303L, 5616L, 1317L, 1663L, 1353L, 645L, 648L, 457L,
2359L, 2497L, 832L, 523L, 1427L, 511L, 483L, 1300L, 873L,
1185L, 1752L, 2037L, 4849L, 2975L, 1621L, 1235L, 3008L, 1560L,
1075L, 4596L, 1129L, 1093L, 1302L, 1414L, 5542L, 2369L, 4944L,
2338L, 1274L, 1837L, 3384L, 1338L, 2002L, 1756L, 2516L, 1868L,
2017L, 1337L, 1106L, 1388L, 6812L, 5579L, 1695L, 1976L, 1897L,
4484L, 3095L, 1865L, 2283L, 1659L, 1328L, 1882L, 1483L, 1993L,
1776L, 2256L), condition = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("reliable",
"unreliable"), class = "factor"), trial_type = structure(c(2L,
2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L,
2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,
1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L,
1L, 2L, 2L), .Label = c("same", "switch"), class = "factor"),
accuracy = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L,
1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L,
0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L,
1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L,
1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L)), row.names = c(NA, -64L), class = c("tbl_df",
"tbl", "data.frame"))
我想根据 rt
中每个分数与每个参与者的平均值(以标准差为单位)的距离来删除异常值。每个参与者有 16 行,共有 80 个参与者。 R 正确地将给定参与者的 16 行中的每一行解释为一个因子水平。
我想用它来分别为 subject
中的每个级别和 trial_type
中的每个级别从 rt
中删除异常值。我用来删除 跨 所有参与者的异常值的公式是:
# Calculate mean and `sd` for each level of `trial_type`
# For 'same':
MeanSame <- mean(RTs$rt[RTs$trial_type == "same"])
SDSame <- sd(RTs$rt[RTs$trial_type == "same"])
# For 'switch':
MeanSwitch <- mean(RTs$rt[RTs$trial_type == "switch"])
SDSwitch <- sd(RTs$rt[RTs$trial_type == "switch"])
# Create upper and lower cut for level 'same' of 'trial_type':
UpperSame <- MeanSame + 2.5*SDSame
LowerSame <- MeanSame - 2.5*SDSame
# Create upper and lower cut for level 'switch' of 'trial_type':
UpperSwitch <-2.5*SDSwitch + MeanSwitch
LowerSwitch <- MeanSwitch - 2.5*SDSwitch
#Identify Outliers in Same
OutliersSameUpper <- which(RTs$rt > UpperSame & RTs$trial_type == "same")
OutliersSameLower <- which(RTs$rt < LowerSame & RTs$trial_type == "same")
# Identify Outliers in Switch
OutliersSwitchUpper <- which(RTs$rt > UpperSwitch & RTs$trial_type == "switch")
OutliersSwitchLower <- which(RTs$rt < LowerSwitch & RTs$trial_type == "switch")
# Create new data.frame without the identified outliers:
RTsClean <- RTs[-c(OutliersSameUpper,OutliersSameLower,OutliersSwitchUpper,OutliersSwitchLower),]
对于所有参与者,我这样做的方法是计算每个条件的平均值和 SD,然后在 rt
中找到高于或超出切割点的行。但是,我无法弄清楚如何为 subject
向量的每个级别执行此操作。应用函数对我不起作用,因为应用单个函数是不够的。我还需要跟踪其他变量(平均值和 sd
,以及上限和下限切点)。
在我看来,要走的路是创建一个函数,或一个 for 循环,或两者兼而有之。但这超出了我在 R 中的技能水平。
如果有人能帮助我找到应用我上面指定的异常值删除方法的最佳方法,但可以单独应用于 subject
向量的每个级别,我将不胜感激。这意味着需要为 subject
.
sd
变量
如果可能,最好在 data.frame 中创建一个新列,为 rt
中的每一行指定该行是否被视为异常值。但我不知道如何实现。
在此先感谢您的帮助。
基于OP
的这个要求I would appreciate anyone who could help me find the best way to apply the method for outlier removal I specified above, but in a way that can be applied separately to each level of the subject vector. This means that the mean and sd variables need to be specified for each level of subject.
它删除了三行作为异常值
library(dplyr)
RTs %>% group_by(subject) %>%
filter(rt <= mean(rt) + (2.5 * sd(rt)), rt >= mean(rt) - (2.5 * sd(rt)))
#> # A tibble: 61 x 5
#> # Groups: subject [4]
#> subject rt condition trial_type accuracy
#> <fct> <int> <fct> <fct> <int>
#> 1 2p1 4303 reliable switch 0
#> 2 2p1 1317 reliable switch 0
#> 3 2p1 1663 reliable same 1
#> 4 2p1 1353 reliable switch 1
#> 5 2p1 645 reliable same 1
#> 6 2p1 648 reliable same 1
#> 7 2p1 457 reliable same 1
#> 8 2p1 2359 reliable switch 0
#> 9 2p1 2497 reliable switch 0
#> 10 2p1 832 reliable same 1
#> # ... with 51 more rows
基于 OP
的进一步要求If possible, it would be even better to create a new column in the data.frame that specified, for each row in rt, whether that row was considered to be an outlier or not. But I don't have a clue how I could achieve that.
RTs %>%
group_by(subject) %>%
mutate(OUTLIER = rt >= mean(rt) + (2.5 * sd(rt)) | rt <= mean(rt) - (2.5 * sd(rt)))
# A tibble: 64 x 6
# Groups: subject [4]
subject rt condition trial_type accuracy OUTLIER
<fct> <int> <fct> <fct> <int> <lgl>
1 2p1 4303 reliable switch 0 FALSE
2 2p1 5616 reliable switch 0 TRUE
3 2p1 1317 reliable switch 0 FALSE
4 2p1 1663 reliable same 1 FALSE
5 2p1 1353 reliable switch 1 FALSE
6 2p1 645 reliable same 1 FALSE
7 2p1 648 reliable same 1 FALSE
8 2p1 457 reliable same 1 FALSE
9 2p1 2359 reliable switch 0 FALSE
10 2p1 2497 reliable switch 0 FALSE
# ... with 54 more rows
显然 TRUE
表示它是异常值,而 FALSE
表示其他情况
BaseR 方式
RTs$outlier <- as.logical(ave(RTs$rt, RTs$subject,
FUN = function(.x) (.x >= mean(.x) + 2.5 * sd(.x)) |
(.x <= mean(.x) - 2.5 * sd(.x))))
RTs
# A tibble: 64 x 6
subject rt condition trial_type accuracy outlier
<fct> <int> <fct> <fct> <int> <lgl>
1 2p1 4303 reliable switch 0 FALSE
2 2p1 5616 reliable switch 0 TRUE
3 2p1 1317 reliable switch 0 FALSE
4 2p1 1663 reliable same 1 FALSE
5 2p1 1353 reliable switch 1 FALSE
6 2p1 645 reliable same 1 FALSE
7 2p1 648 reliable same 1 FALSE
8 2p1 457 reliable same 1 FALSE
9 2p1 2359 reliable switch 0 FALSE
10 2p1 2497 reliable switch 0 FALSE
# ... with 54 more rows
在这里您可以通过 group_by
和 summarise
获得结果。
你可以适应 -> 任何你需要的:
library(dplyr)
RTs %>%
group_by(subject, trial_type) %>%
summarise(mean= mean(rt), sd = sd(rt), Upper = 2.5*sd + mean, Lower = mean - 2.5*sd,
OutliersUpper <- rt > Upper, OutliersLower <- rt < Lower)
输出:
subject trial_type mean sd Upper Lower `OutliersUpper <- rt > Upper` `OutliersLower <- rt < Lower`
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
1 2p1 same 720. 400. 1720. -280. FALSE FALSE
2 2p1 same 720. 400. 1720. -280. FALSE FALSE
3 2p1 same 720. 400. 1720. -280. FALSE FALSE
4 2p1 same 720. 400. 1720. -280. FALSE FALSE
5 2p1 same 720. 400. 1720. -280. FALSE FALSE
6 2p1 same 720. 400. 1720. -280. FALSE FALSE
7 2p1 same 720. 400. 1720. -280. FALSE FALSE
8 2p1 same 720. 400. 1720. -280. FALSE FALSE
9 2p1 switch 2522. 1616. 6562. -1519. FALSE FALSE
你可以改编:例如
RTs %>%
group_by(subject, trial_type, condition, accuracy) %>%
summarise(mean= mean(rt), sd = sd(rt), Upper = 2.5*sd + mean, Lower = mean - 2.5*sd,
OutliersUpper <- rt > Upper, OutliersLower <- rt < Lower)
其他答案的通用版本。
首先,将您现有的过滤转换为函数:
dropOutliers <- function(.x, .y, multiplier=2.5) {
limits <- .x %>%
group_by(trial_type) %>%
summarise(
SD=sd(rt),
Mean=mean(rt),
Lower=Mean - multiplier * SD,
Upper=Mean + multiplier * SD
)
.x %>%
left_join(
limits,
by="trial_type"
) %>%
filter(rt >= Lower && rt <= Upper) %>%
select(-SD, -Mean, -Upper, -Lower)
}
我使用 .x
和 .y
作为参数名称,因为它们与 group_map()
的文档相匹配,稍后我将使用它们。 multiplier
用于测试。
现在将函数应用于以您希望的任何方式分组的数据框。 group_map
return 是 tibbles 列表,因此 row_bind()
return 值可根据需要获得单个组合的 tibble。
d %>% group_by(subject) %>% group_map(dropOutliers) %>% bind_rows()
这似乎没有删除任何行,因此应用更积极的过滤来检查:
d %>% group_by(subject) %>% group_map(dropOutliers, multiplier=1) %>% bind_rows()
# A tibble: 16 x 4
rt condition trial_type accuracy
<int> <fct> <fct> <int>
1 873 reliable switch 1
2 1185 reliable same 1
3 1752 reliable same 1
4 2037 reliable same 1
5 4849 reliable switch 1
6 2975 reliable switch 0
7 1621 reliable switch 0
8 1235 reliable same 1
9 3008 reliable switch 0
10 1560 reliable same 1
11 1075 reliable switch 0
12 4596 reliable same 1
13 1129 reliable same 1
14 1093 reliable switch 0
15 1302 reliable same 1
16 1414 reliable switch 0