基于 MAD 为时间序列的汇总数据集和平均值识别异常值的 ID 和移除异常值

Question

希望您平安无事。

我有两个互补的数据集，一个具有浓度变化值的时间序列 (Timeseries)，另一个具有该时间序列的平均值 (MeanConcentration)。

我想根据 MeanConcentration 数据集中每个变量的 3 个中值绝对偏差来识别异常值。首先，我想弄清楚检测到的每个异常值的 ID 和关联变量是什么。这将允许我首先手动检查这是否确实是一个工件并且应该被删除。然后，我想创建一个函数来删除这些异常值。

然后我想对时间序列数据应用相同的排除标准（因此，如果我们确定变量 A 的参与者 A 在数据集 1 中被排除，我也想将其排除在数据集 2 中）。对于时间序列数据，我想根据 5 到 9 秒的平均值评估中值绝对偏差（以使其与平均浓度数据集互补）。请注意，MAD 还必须按发色团、条件和 ROI 分组。

MeanConcentration<-as.data.frame(ID = c(1,2,3,4,5), Happy_HbO_LeftParietal_Value = c(0.239005609756098, 
-0.812496292682927, -1.03227064146341, 0.469810975609756, -0.456419951219512
), Happy_HbO_RightParietal_Value = c(-1.97862195121951, -0.0803191658536585, 
-0.0456078780487805, -0.29708887804878, 0.109126317073171), Happy_HbO_LeftSTC_Value = c(5.66059024390244, 
-2.49184243902439, -0.876321414634146, 1.44561070731707, 0.0991754146341463
), Happy_HbO_RightSTC_Value = c(0.0138107658536585, 0.829429967804878, 
-1.06818609756098, 0.636765365853659, -0.609962195121951), Happy_HbO_LeftDLPFC_Value = c(2.30691146341463, 
0.749746341463415, 2.60103658536585, 0.870573414634146, -1.73371634146341
))

##TimeSeries Data Frame Example ##
 ID       time Condition Chromophore ROI            Value
 <chr>   <dbl> <fct>     <fct>       <fct>          <dbl>
1 1      -2 Happy     HbO         LeftParietal   0.848
2 1     -2 Happy     HHb         LeftParietal  -0.243
3 1     -2 Happy     HbO         RightParietal  3.80 
4 1     -2 Happy     HHb         RightParietal -0.289
5 1      -2 Happy     HbO         LeftSTC        2.15 
6 1      -2 Happy     HHb         LeftSTC       -1.26

Answer 1

我不确定我是否完全理解您的问题，但这应该与您正在寻找的内容非常接近（只需评论什么是 wrong/missing，我会相应地更新答案）：

library(dplyr)
library(tidyr)
library(data.table) # to read in plain text as table

    MeanConcentration <- data.frame(ID = c(1,2,3,4,5), 
                                     Happy_HbO_LeftParietal_Value = c(0.239005609756098, -0.812496292682927, -1.03227064146341, 0.469810975609756, -0.456419951219512),
                                 Happy_HbO_RightParietal_Value = c(-1.97862195121951, -0.0803191658536585, -0.0456078780487805, -0.29708887804878, 0.109126317073171), 
                                 Happy_HbO_LeftSTC_Value = c(5.66059024390244, -2.49184243902439, -0.876321414634146, 1.44561070731707, 0.0991754146341463), 
                                 Happy_HbO_RightSTC_Value = c(0.0138107658536585, 0.829429967804878, -1.06818609756098, 0.636765365853659, -0.609962195121951),
                                 Happy_HbO_LeftDLPFC_Value = c(2.30691146341463, 0.749746341463415, 2.60103658536585, 0.870573414634146, -1.73371634146341 ))


TS <- data.table::fread("ID       time Condition Chromophore ROI            Value
 1      -2 Happy     HbO         LeftParietal   0.848
 1      -2 Happy     HHb         LeftParietal  -0.243
 1      -2 Happy     HbO         RightParietal  3.80 
 1      -2 Happy     HHb         RightParietal -0.289
 1      -2 Happy     HbO         LeftSTC        2.15 
 1      -2 Happy     HHb         LeftSTC       -1.26")

OULIERS <- MeanConcentration %>% 
  # convert data so column names become variables and we have all variable values in one column
  tidyr::pivot_longer(cols = -ID, names_to = "variable", values_to = "values") %>% 
  # split up the colum of the variable names (you get a warning here as the process will generate a 4th column of the word "value" which is mentioned and therefore gets dropped
  tidyr::separate(variable, c("Condition", "Chromophore", "ROI"), sep = "_") %>% 
  # group by the 3 parts of the variable (same as grouping just per variable without splitting)
  dplyr::group_by(Condition, Chromophore, ROI) %>% 
  # make a new column for media and mad - now check if value outside of median +- 3 MAD of define it as an outlier
  dplyr::mutate(MEDIAN = median(values, na.rm = TRUE),
                MAD = mad(values, na.rm = TRUE),
                OUTLIER = ifelse(values > MEDIAN + 3 * MAD | values < MEDIAN - 3 * MAD, "YES", "NO")) %>% 
  # ungroup (not necessary but recomended)
  dplyr::ungroup() %>% 
  # get only the outliers
  dplyr::filter(OUTLIER == "YES")  

# print the outliers for inspection
OULIERS

     ID Condition Chromophore ROI           values  MEDIAN   MAD OUTLIER
  <dbl> <chr>     <chr>       <chr>          <dbl>   <dbl> <dbl> <chr>  
1     1 Happy     HbO         RightParietal  -1.98 -0.0803 0.281 YES 

# remove outliers by combo of the 3 columns (possibly you want to include ID here?)
TS %>% 
  dplyr::anti_join(OULIERS, by = c("Condition", "Chromophore", "ROI"))

   ID time Condition Chromophore           ROI  Value
1:  1   -2     Happy         HbO  LeftParietal  0.848
2:  1   -2     Happy         HHb  LeftParietal -0.243
3:  1   -2     Happy         HHb RightParietal -0.289
4:  1   -2     Happy         HbO       LeftSTC  2.150
5:  1   -2     Happy         HHb       LeftSTC -1.260

基于 MAD 为时间序列的汇总数据集和平均值识别异常值的 ID 和移除异常值

Identifying the IDs for outliers and Removing Outliers Based on MAD for both the Summarized dataset and Averages of a Time Series

r

time-series

outliers