将函数应用于因子(参与者)的每个级别,以根据 R 中标准差与平均值的距离去除异常值

Apply function to each level of a factor (participant) to remove outliers based on distance from mean in standard deviation in R

我在 R 中有一个 data.frame,其中一列表示实验中的参与者 subject,另一列表示 conditiontrial_type,最后一列是我的数字因变量 rt.

这是我的 data.frame:

的前 64 行使用 dput() 函数生成的数据的可重现示例
 structure(list(subject = structure(c(21L, 21L, 21L, 21L, 21L, 
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 24L, 24L, 
24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 
24L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 
27L, 27L, 27L, 27L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 
47L, 47L, 47L, 47L, 47L, 47L, 47L), .Label = c("1p12", "1p13", 
"1p15", "1p30", "1p36", "1p39", "1p43", "1p46", "1p49", "1p59", 
"1p60", "1p67", "1p69", "1p79", "1p80", "1p81", "1p84", "1p85", 
"1p88", "1p9", "2p1", "2p11", "2p18", "2p2", "2p22", "2p25", 
"2p3", "2p31", "2p33", "2p42", "2p44", "2p5", "2p50", "2p58", 
"2p63", "2p72", "2p76", "2p78", "2p8", "2p83", "3p10", "3p16", 
"3p20", "3p28", "3p32", "3p34", "3p4", "3p41", "3p47", "3p54", 
"3p55", "3p56", "3p61", "3p64", "3p66", "3p7", "3p77", "3p82", 
"3p86", "3p87", "4p14", "4p17", "4p19", "4p21", "4p24", "4p26", 
"4p27", "4p29", "4p35", "4p37", "4p38", "4p48", "4p51", "4p57", 
"4p6", "4p62", "4p68", "4p70", "4p74", "4p75"), class = "factor"), 
    rt = c(4303L, 5616L, 1317L, 1663L, 1353L, 645L, 648L, 457L, 
    2359L, 2497L, 832L, 523L, 1427L, 511L, 483L, 1300L, 873L, 
    1185L, 1752L, 2037L, 4849L, 2975L, 1621L, 1235L, 3008L, 1560L, 
    1075L, 4596L, 1129L, 1093L, 1302L, 1414L, 5542L, 2369L, 4944L, 
    2338L, 1274L, 1837L, 3384L, 1338L, 2002L, 1756L, 2516L, 1868L, 
    2017L, 1337L, 1106L, 1388L, 6812L, 5579L, 1695L, 1976L, 1897L, 
    4484L, 3095L, 1865L, 2283L, 1659L, 1328L, 1882L, 1483L, 1993L, 
    1776L, 2256L), condition = structure(c(1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("reliable", 
    "unreliable"), class = "factor"), trial_type = structure(c(2L, 
    2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 
    2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 
    2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 
    1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 
    1L, 2L, 2L), .Label = c("same", "switch"), class = "factor"), 
    accuracy = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 
    1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 
    0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 
    1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 
    1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L)), row.names = c(NA, -64L), class = c("tbl_df", 
"tbl", "data.frame"))

我想根据 rt 中每个分数与每个参与者的平均值(以标准差为单位)的距离来删除异常值。每个参与者有 16 行,共有 80 个参与者。 R 正确地将给定参与者的 16 行中的每一行解释为一个因子水平。

我想用它来分别为 subject 中的每个级别和 trial_type 中的每个级别从 rt 中删除异常值。我用来删除 所有参与者的异常值的公式是:

# Calculate mean and `sd` for each level of `trial_type`

# For 'same':

MeanSame <- mean(RTs$rt[RTs$trial_type == "same"])
SDSame <- sd(RTs$rt[RTs$trial_type == "same"])

# For 'switch':

MeanSwitch <- mean(RTs$rt[RTs$trial_type == "switch"])
SDSwitch <- sd(RTs$rt[RTs$trial_type == "switch"])

# Create upper and lower cut for level 'same' of 'trial_type':
UpperSame <- MeanSame + 2.5*SDSame
LowerSame <- MeanSame - 2.5*SDSame

# Create upper and lower cut for level 'switch' of 'trial_type':
UpperSwitch <-2.5*SDSwitch + MeanSwitch
LowerSwitch <- MeanSwitch - 2.5*SDSwitch


#Identify Outliers in Same

OutliersSameUpper <- which(RTs$rt > UpperSame & RTs$trial_type == "same") 

OutliersSameLower <- which(RTs$rt < LowerSame & RTs$trial_type == "same")


# Identify Outliers in Switch

OutliersSwitchUpper <- which(RTs$rt > UpperSwitch  & RTs$trial_type == "switch") 

OutliersSwitchLower <- which(RTs$rt < LowerSwitch & RTs$trial_type == "switch")


# Create new data.frame without the identified outliers:

RTsClean <- RTs[-c(OutliersSameUpper,OutliersSameLower,OutliersSwitchUpper,OutliersSwitchLower),]

对于所有参与者,我这样做的方法是计算每个条件的平均值和 SD,然后在 rt 中找到高于或超出切割点的行。但是,我无法弄清楚如何为 subject 向量的每个级别执行此操作。应用函数对我不起作用,因为应用单个函数是不够的。我还需要跟踪其他变量(平均值和 sd,以及上限和下限切点)。

在我看来,要走的路是创建一个函数,或一个 for 循环,或两者兼而有之。但这超出了我在 R 中的技能水平。

如果有人能帮助我找到应用我上面指定的异常值删除方法的最佳方法,但可以单独应用于 subject 向量的每个级别,我将不胜感激。这意味着需要为 subject.

的每个级别指定均值和 sd 变量

如果可能,最好在 data.frame 中创建一个新列,为 rt 中的每一行指定该行是否被视为异常值。但我不知道如何实现。

在此先感谢您的帮助。

基于OP

的这个要求

I would appreciate anyone who could help me find the best way to apply the method for outlier removal I specified above, but in a way that can be applied separately to each level of the subject vector. This means that the mean and sd variables need to be specified for each level of subject.

它删除了三行作为异常值


library(dplyr)

RTs %>% group_by(subject) %>%
  filter(rt <= mean(rt) + (2.5 * sd(rt)),  rt >= mean(rt) - (2.5 * sd(rt)))

#> # A tibble: 61 x 5
#> # Groups:   subject [4]
#>    subject    rt condition trial_type accuracy
#>    <fct>   <int> <fct>     <fct>         <int>
#>  1 2p1      4303 reliable  switch            0
#>  2 2p1      1317 reliable  switch            0
#>  3 2p1      1663 reliable  same              1
#>  4 2p1      1353 reliable  switch            1
#>  5 2p1       645 reliable  same              1
#>  6 2p1       648 reliable  same              1
#>  7 2p1       457 reliable  same              1
#>  8 2p1      2359 reliable  switch            0
#>  9 2p1      2497 reliable  switch            0
#> 10 2p1       832 reliable  same              1
#> # ... with 51 more rows

基于 OP

的进一步要求

If possible, it would be even better to create a new column in the data.frame that specified, for each row in rt, whether that row was considered to be an outlier or not. But I don't have a clue how I could achieve that.

RTs %>%
  group_by(subject) %>%
  mutate(OUTLIER = rt >= mean(rt) + (2.5 * sd(rt)) | rt <= mean(rt) - (2.5 * sd(rt)))

# A tibble: 64 x 6
# Groups:   subject [4]
   subject    rt condition trial_type accuracy OUTLIER
   <fct>   <int> <fct>     <fct>         <int> <lgl>  
 1 2p1      4303 reliable  switch            0 FALSE  
 2 2p1      5616 reliable  switch            0 TRUE   
 3 2p1      1317 reliable  switch            0 FALSE  
 4 2p1      1663 reliable  same              1 FALSE  
 5 2p1      1353 reliable  switch            1 FALSE  
 6 2p1       645 reliable  same              1 FALSE  
 7 2p1       648 reliable  same              1 FALSE  
 8 2p1       457 reliable  same              1 FALSE  
 9 2p1      2359 reliable  switch            0 FALSE  
10 2p1      2497 reliable  switch            0 FALSE  
# ... with 54 more rows

显然 TRUE 表示它是异常值,而 FALSE 表示其他情况


BaseR 方式

RTs$outlier <- as.logical(ave(RTs$rt, RTs$subject, 
               FUN = function(.x) (.x >= mean(.x) + 2.5 * sd(.x)) | 
                 (.x <= mean(.x) - 2.5 * sd(.x))))
RTs

# A tibble: 64 x 6
   subject    rt condition trial_type accuracy outlier
   <fct>   <int> <fct>     <fct>         <int> <lgl>  
 1 2p1      4303 reliable  switch            0 FALSE  
 2 2p1      5616 reliable  switch            0 TRUE   
 3 2p1      1317 reliable  switch            0 FALSE  
 4 2p1      1663 reliable  same              1 FALSE  
 5 2p1      1353 reliable  switch            1 FALSE  
 6 2p1       645 reliable  same              1 FALSE  
 7 2p1       648 reliable  same              1 FALSE  
 8 2p1       457 reliable  same              1 FALSE  
 9 2p1      2359 reliable  switch            0 FALSE  
10 2p1      2497 reliable  switch            0 FALSE  
# ... with 54 more rows

在这里您可以通过 group_bysummarise 获得结果。 你可以适应 -> 任何你需要的:

library(dplyr)
RTs %>% 
  group_by(subject, trial_type) %>% 
  summarise(mean= mean(rt), sd = sd(rt), Upper = 2.5*sd + mean, Lower = mean - 2.5*sd,
            OutliersUpper <- rt > Upper, OutliersLower <- rt < Lower)

输出:

   subject trial_type  mean    sd Upper  Lower `OutliersUpper <- rt > Upper` `OutliersLower <- rt < Lower`
   <fct>   <fct>      <dbl> <dbl> <dbl>  <dbl> <lgl>                         <lgl>                        
 1 2p1     same        720.  400. 1720.  -280. FALSE                         FALSE                        
 2 2p1     same        720.  400. 1720.  -280. FALSE                         FALSE                        
 3 2p1     same        720.  400. 1720.  -280. FALSE                         FALSE                        
 4 2p1     same        720.  400. 1720.  -280. FALSE                         FALSE                        
 5 2p1     same        720.  400. 1720.  -280. FALSE                         FALSE                        
 6 2p1     same        720.  400. 1720.  -280. FALSE                         FALSE                        
 7 2p1     same        720.  400. 1720.  -280. FALSE                         FALSE                        
 8 2p1     same        720.  400. 1720.  -280. FALSE                         FALSE                        
 9 2p1     switch     2522. 1616. 6562. -1519. FALSE                         FALSE    

你可以改编:例如

RTs %>% 
  group_by(subject, trial_type, condition, accuracy) %>% 
  summarise(mean= mean(rt), sd = sd(rt), Upper = 2.5*sd + mean, Lower = mean - 2.5*sd,
            OutliersUpper <- rt > Upper, OutliersLower <- rt < Lower)

其他答案的通用版本。

首先,将您现有的过滤转换为函数:

dropOutliers <- function(.x, .y, multiplier=2.5) {
  limits <- .x %>% 
              group_by(trial_type) %>% 
              summarise(
                SD=sd(rt), 
                Mean=mean(rt), 
                Lower=Mean - multiplier * SD, 
                Upper=Mean + multiplier * SD
              )
  .x %>% 
    left_join(
      limits, 
      by="trial_type"
    ) %>% 
    filter(rt >= Lower && rt <= Upper) %>% 
    select(-SD, -Mean, -Upper, -Lower)
}

我使用 .x.y 作为参数名称,因为它们与 group_map() 的文档相匹配,稍后我将使用它们。 multiplier 用于测试。

现在将函数应用于以您希望的任何方式分组的数据框。 group_map return 是 tibbles 列表,因此 row_bind() return 值可根据需要获得单个组合的 tibble。

d %>% group_by(subject) %>% group_map(dropOutliers) %>% bind_rows()

这似乎没有删除任何行,因此应用更积极的过滤来检查:

d %>% group_by(subject) %>% group_map(dropOutliers, multiplier=1) %>% bind_rows()
# A tibble: 16 x 4
      rt condition trial_type accuracy
   <int> <fct>     <fct>         <int>
 1   873 reliable  switch            1
 2  1185 reliable  same              1
 3  1752 reliable  same              1
 4  2037 reliable  same              1
 5  4849 reliable  switch            1
 6  2975 reliable  switch            0
 7  1621 reliable  switch            0
 8  1235 reliable  same              1
 9  3008 reliable  switch            0
10  1560 reliable  same              1
11  1075 reliable  switch            0
12  4596 reliable  same              1
13  1129 reliable  same              1
14  1093 reliable  switch            0
15  1302 reliable  same              1
16  1414 reliable  switch            0