基于 MAD 为时间序列的汇总数据集和平均值识别异常值的 ID 和移除异常值
Identifying the IDs for outliers and Removing Outliers Based on MAD for both the Summarized dataset and Averages of a Time Series
希望您平安无事。
我有两个互补的数据集,一个具有浓度变化值的时间序列 (Timeseries),另一个具有该时间序列的平均值 (MeanConcentration)。
我想根据 MeanConcentration 数据集中每个变量的 3 个中值绝对偏差来识别异常值。首先,我想弄清楚检测到的每个异常值的 ID 和关联变量是什么。这将允许我首先手动检查这是否确实是一个工件并且应该被删除。然后,我想创建一个函数来删除这些异常值。
然后我想对时间序列数据应用相同的排除标准(因此,如果我们确定变量 A 的参与者 A 在数据集 1 中被排除,我也想将其排除在数据集 2 中)。对于时间序列数据,我想根据 5 到 9 秒的平均值评估中值绝对偏差(以使其与平均浓度数据集互补)。请注意,MAD 还必须按发色团、条件和 ROI 分组。
MeanConcentration<-as.data.frame(ID = c(1,2,3,4,5), Happy_HbO_LeftParietal_Value = c(0.239005609756098,
-0.812496292682927, -1.03227064146341, 0.469810975609756, -0.456419951219512
), Happy_HbO_RightParietal_Value = c(-1.97862195121951, -0.0803191658536585,
-0.0456078780487805, -0.29708887804878, 0.109126317073171), Happy_HbO_LeftSTC_Value = c(5.66059024390244,
-2.49184243902439, -0.876321414634146, 1.44561070731707, 0.0991754146341463
), Happy_HbO_RightSTC_Value = c(0.0138107658536585, 0.829429967804878,
-1.06818609756098, 0.636765365853659, -0.609962195121951), Happy_HbO_LeftDLPFC_Value = c(2.30691146341463,
0.749746341463415, 2.60103658536585, 0.870573414634146, -1.73371634146341
))
##TimeSeries Data Frame Example ##
ID time Condition Chromophore ROI Value
<chr> <dbl> <fct> <fct> <fct> <dbl>
1 1 -2 Happy HbO LeftParietal 0.848
2 1 -2 Happy HHb LeftParietal -0.243
3 1 -2 Happy HbO RightParietal 3.80
4 1 -2 Happy HHb RightParietal -0.289
5 1 -2 Happy HbO LeftSTC 2.15
6 1 -2 Happy HHb LeftSTC -1.26
我不确定我是否完全理解您的问题,但这应该与您正在寻找的内容非常接近(只需评论什么是 wrong/missing,我会相应地更新答案):
library(dplyr)
library(tidyr)
library(data.table) # to read in plain text as table
MeanConcentration <- data.frame(ID = c(1,2,3,4,5),
Happy_HbO_LeftParietal_Value = c(0.239005609756098, -0.812496292682927, -1.03227064146341, 0.469810975609756, -0.456419951219512),
Happy_HbO_RightParietal_Value = c(-1.97862195121951, -0.0803191658536585, -0.0456078780487805, -0.29708887804878, 0.109126317073171),
Happy_HbO_LeftSTC_Value = c(5.66059024390244, -2.49184243902439, -0.876321414634146, 1.44561070731707, 0.0991754146341463),
Happy_HbO_RightSTC_Value = c(0.0138107658536585, 0.829429967804878, -1.06818609756098, 0.636765365853659, -0.609962195121951),
Happy_HbO_LeftDLPFC_Value = c(2.30691146341463, 0.749746341463415, 2.60103658536585, 0.870573414634146, -1.73371634146341 ))
TS <- data.table::fread("ID time Condition Chromophore ROI Value
1 -2 Happy HbO LeftParietal 0.848
1 -2 Happy HHb LeftParietal -0.243
1 -2 Happy HbO RightParietal 3.80
1 -2 Happy HHb RightParietal -0.289
1 -2 Happy HbO LeftSTC 2.15
1 -2 Happy HHb LeftSTC -1.26")
OULIERS <- MeanConcentration %>%
# convert data so column names become variables and we have all variable values in one column
tidyr::pivot_longer(cols = -ID, names_to = "variable", values_to = "values") %>%
# split up the colum of the variable names (you get a warning here as the process will generate a 4th column of the word "value" which is mentioned and therefore gets dropped
tidyr::separate(variable, c("Condition", "Chromophore", "ROI"), sep = "_") %>%
# group by the 3 parts of the variable (same as grouping just per variable without splitting)
dplyr::group_by(Condition, Chromophore, ROI) %>%
# make a new column for media and mad - now check if value outside of median +- 3 MAD of define it as an outlier
dplyr::mutate(MEDIAN = median(values, na.rm = TRUE),
MAD = mad(values, na.rm = TRUE),
OUTLIER = ifelse(values > MEDIAN + 3 * MAD | values < MEDIAN - 3 * MAD, "YES", "NO")) %>%
# ungroup (not necessary but recomended)
dplyr::ungroup() %>%
# get only the outliers
dplyr::filter(OUTLIER == "YES")
# print the outliers for inspection
OULIERS
ID Condition Chromophore ROI values MEDIAN MAD OUTLIER
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 1 Happy HbO RightParietal -1.98 -0.0803 0.281 YES
# remove outliers by combo of the 3 columns (possibly you want to include ID here?)
TS %>%
dplyr::anti_join(OULIERS, by = c("Condition", "Chromophore", "ROI"))
ID time Condition Chromophore ROI Value
1: 1 -2 Happy HbO LeftParietal 0.848
2: 1 -2 Happy HHb LeftParietal -0.243
3: 1 -2 Happy HHb RightParietal -0.289
4: 1 -2 Happy HbO LeftSTC 2.150
5: 1 -2 Happy HHb LeftSTC -1.260
希望您平安无事。
我有两个互补的数据集,一个具有浓度变化值的时间序列 (Timeseries),另一个具有该时间序列的平均值 (MeanConcentration)。
我想根据 MeanConcentration 数据集中每个变量的 3 个中值绝对偏差来识别异常值。首先,我想弄清楚检测到的每个异常值的 ID 和关联变量是什么。这将允许我首先手动检查这是否确实是一个工件并且应该被删除。然后,我想创建一个函数来删除这些异常值。
然后我想对时间序列数据应用相同的排除标准(因此,如果我们确定变量 A 的参与者 A 在数据集 1 中被排除,我也想将其排除在数据集 2 中)。对于时间序列数据,我想根据 5 到 9 秒的平均值评估中值绝对偏差(以使其与平均浓度数据集互补)。请注意,MAD 还必须按发色团、条件和 ROI 分组。
MeanConcentration<-as.data.frame(ID = c(1,2,3,4,5), Happy_HbO_LeftParietal_Value = c(0.239005609756098,
-0.812496292682927, -1.03227064146341, 0.469810975609756, -0.456419951219512
), Happy_HbO_RightParietal_Value = c(-1.97862195121951, -0.0803191658536585,
-0.0456078780487805, -0.29708887804878, 0.109126317073171), Happy_HbO_LeftSTC_Value = c(5.66059024390244,
-2.49184243902439, -0.876321414634146, 1.44561070731707, 0.0991754146341463
), Happy_HbO_RightSTC_Value = c(0.0138107658536585, 0.829429967804878,
-1.06818609756098, 0.636765365853659, -0.609962195121951), Happy_HbO_LeftDLPFC_Value = c(2.30691146341463,
0.749746341463415, 2.60103658536585, 0.870573414634146, -1.73371634146341
))
##TimeSeries Data Frame Example ##
ID time Condition Chromophore ROI Value
<chr> <dbl> <fct> <fct> <fct> <dbl>
1 1 -2 Happy HbO LeftParietal 0.848
2 1 -2 Happy HHb LeftParietal -0.243
3 1 -2 Happy HbO RightParietal 3.80
4 1 -2 Happy HHb RightParietal -0.289
5 1 -2 Happy HbO LeftSTC 2.15
6 1 -2 Happy HHb LeftSTC -1.26
我不确定我是否完全理解您的问题,但这应该与您正在寻找的内容非常接近(只需评论什么是 wrong/missing,我会相应地更新答案):
library(dplyr)
library(tidyr)
library(data.table) # to read in plain text as table
MeanConcentration <- data.frame(ID = c(1,2,3,4,5),
Happy_HbO_LeftParietal_Value = c(0.239005609756098, -0.812496292682927, -1.03227064146341, 0.469810975609756, -0.456419951219512),
Happy_HbO_RightParietal_Value = c(-1.97862195121951, -0.0803191658536585, -0.0456078780487805, -0.29708887804878, 0.109126317073171),
Happy_HbO_LeftSTC_Value = c(5.66059024390244, -2.49184243902439, -0.876321414634146, 1.44561070731707, 0.0991754146341463),
Happy_HbO_RightSTC_Value = c(0.0138107658536585, 0.829429967804878, -1.06818609756098, 0.636765365853659, -0.609962195121951),
Happy_HbO_LeftDLPFC_Value = c(2.30691146341463, 0.749746341463415, 2.60103658536585, 0.870573414634146, -1.73371634146341 ))
TS <- data.table::fread("ID time Condition Chromophore ROI Value
1 -2 Happy HbO LeftParietal 0.848
1 -2 Happy HHb LeftParietal -0.243
1 -2 Happy HbO RightParietal 3.80
1 -2 Happy HHb RightParietal -0.289
1 -2 Happy HbO LeftSTC 2.15
1 -2 Happy HHb LeftSTC -1.26")
OULIERS <- MeanConcentration %>%
# convert data so column names become variables and we have all variable values in one column
tidyr::pivot_longer(cols = -ID, names_to = "variable", values_to = "values") %>%
# split up the colum of the variable names (you get a warning here as the process will generate a 4th column of the word "value" which is mentioned and therefore gets dropped
tidyr::separate(variable, c("Condition", "Chromophore", "ROI"), sep = "_") %>%
# group by the 3 parts of the variable (same as grouping just per variable without splitting)
dplyr::group_by(Condition, Chromophore, ROI) %>%
# make a new column for media and mad - now check if value outside of median +- 3 MAD of define it as an outlier
dplyr::mutate(MEDIAN = median(values, na.rm = TRUE),
MAD = mad(values, na.rm = TRUE),
OUTLIER = ifelse(values > MEDIAN + 3 * MAD | values < MEDIAN - 3 * MAD, "YES", "NO")) %>%
# ungroup (not necessary but recomended)
dplyr::ungroup() %>%
# get only the outliers
dplyr::filter(OUTLIER == "YES")
# print the outliers for inspection
OULIERS
ID Condition Chromophore ROI values MEDIAN MAD OUTLIER
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 1 Happy HbO RightParietal -1.98 -0.0803 0.281 YES
# remove outliers by combo of the 3 columns (possibly you want to include ID here?)
TS %>%
dplyr::anti_join(OULIERS, by = c("Condition", "Chromophore", "ROI"))
ID time Condition Chromophore ROI Value
1: 1 -2 Happy HbO LeftParietal 0.848
2: 1 -2 Happy HHb LeftParietal -0.243
3: 1 -2 Happy HHb RightParietal -0.289
4: 1 -2 Happy HbO LeftSTC 2.150
5: 1 -2 Happy HHb LeftSTC -1.260