如何根据过滤值用修改后的列替换 R 中的列? (删除面板数据中的异常值)
How to replace a column in R by a modified column, dependent on filtered values? (removing outliers in panel data)
我有一个这样的面板数据集
year
id
treatment_year
time_to_treatment
outcome
2000
1
2011
-11
2
2002
1
2011
-10
3
2004
2
2015
-9
22
等等等等。我正在尝试通过 'Winsorize' 处理异常值。最终目标是制作一个散点图,X 轴为 time_to_treatment,Y 轴为结果。
我想将每个 time_to_treatment 的结果替换为它的 winsorized 结果,即将所有极值替换为 5% 和 95% 分位数。
到目前为止,我尝试做的是这个,但它不起作用。
for(i in range(dataset$time_to_treatment)){
dplyr::filter(dataset, time_to_treatment == i)$outcome <- DescTools::Winsorize(dplyr::filter(dataset,time_to_treatment==i)$outcome)
}
我收到错误 - 过滤器错误(数据集,time_to_treatment == i)<- *vtmp*
:
找不到函数“过滤器<-”
谁能提供更好的方法?
谢谢
我的实际数据
其中:冲突 = 结果,佣金 = 治疗年份,CD_mun = id.
相关时间段指标为time_to_t
组:年份,CD_MUN,类型 [6]
type
CD_MUN
year
time_to_t
conflicts
commission
chr
dbl
dbl
dbl
int
dbl
manif
1100023
2000
-11
1
2011
manif
1100189
2000
-3
2
2003
manif
1100205
2000
-9
5
2009
manif
1500602
2000
-4
1
2004
manif
3111002
2000
-11
2
2011
manif
3147006
2000
-10
1
2010
首先你可以使用这个:
# The data
set.seed(123)
df <- data.frame(
time_to_treatment = seq(-15, 0, 1),
outcome = sample(1:30, 16, replace=T)
)
# A solution without Winsorize based solely on dplyr
library(dplyr)
df %>%
mutate(outcome05 = quantile(outcome, probs = 0.05), # 5% quantile
outcome95 = quantile(outcome, probs = 0.95), # 95% quantile
outcome = ifelse(outcome <= outcome05, outcome05, outcome), # replace
outcome = ifelse(outcome >= outcome95, outcome95, outcome)) %>%
select(-c(outcome05, outcome95))
您可以根据您的具体问题进行调整。
假设“时间段”指的是 'commission'
列,您可以使用 ave
.
transform(dat, conflicts_w=ave(conflicts, commission, FUN=DescTools::Winsorize))
# type CD_MUN year time_to_t conflicts commission conflicts_w
# 1 manif 1100023 2000 -11 1 2011 1.05
# 2 manif 1100189 2000 -3 2 2003 2.00
# 3 manif 1100205 2000 -9 5 2009 5.00
# 4 manif 1500602 2000 -4 1 2004 1.00
# 5 manif 3111002 2000 -11 2 2011 1.95
# 6 manif 3147006 2000 -10 1 2010 1.00
数据:
dat <- structure(list(type = c("manif", "manif", "manif", "manif", "manif",
"manif"), CD_MUN = c(1100023L, 1100189L, 1100205L, 1500602L,
3111002L, 3147006L), year = c(2000L, 2000L, 2000L, 2000L, 2000L,
2000L), time_to_t = c(-11L, -3L, -9L, -4L, -11L, -10L), conflicts = c(1L,
2L, 5L, 1L, 2L, 1L), commission = c(2011L, 2003L, 2009L, 2004L,
2011L, 2010L)), class = "data.frame", row.names = c(NA, -6L))
我有一个这样的面板数据集
year | id | treatment_year | time_to_treatment | outcome |
---|---|---|---|---|
2000 | 1 | 2011 | -11 | 2 |
2002 | 1 | 2011 | -10 | 3 |
2004 | 2 | 2015 | -9 | 22 |
等等等等。我正在尝试通过 'Winsorize' 处理异常值。最终目标是制作一个散点图,X 轴为 time_to_treatment,Y 轴为结果。
我想将每个 time_to_treatment 的结果替换为它的 winsorized 结果,即将所有极值替换为 5% 和 95% 分位数。 到目前为止,我尝试做的是这个,但它不起作用。
for(i in range(dataset$time_to_treatment)){
dplyr::filter(dataset, time_to_treatment == i)$outcome <- DescTools::Winsorize(dplyr::filter(dataset,time_to_treatment==i)$outcome)
}
我收到错误 - 过滤器错误(数据集,time_to_treatment == i)<- *vtmp*
:
找不到函数“过滤器<-”
谁能提供更好的方法? 谢谢
我的实际数据 其中:冲突 = 结果,佣金 = 治疗年份,CD_mun = id.
相关时间段指标为time_to_t
组:年份,CD_MUN,类型 [6]
type | CD_MUN | year | time_to_t | conflicts | commission |
---|---|---|---|---|---|
chr | dbl | dbl | dbl | int | dbl |
manif | 1100023 | 2000 | -11 | 1 | 2011 |
manif | 1100189 | 2000 | -3 | 2 | 2003 |
manif | 1100205 | 2000 | -9 | 5 | 2009 |
manif | 1500602 | 2000 | -4 | 1 | 2004 |
manif | 3111002 | 2000 | -11 | 2 | 2011 |
manif | 3147006 | 2000 | -10 | 1 | 2010 |
首先你可以使用这个:
# The data
set.seed(123)
df <- data.frame(
time_to_treatment = seq(-15, 0, 1),
outcome = sample(1:30, 16, replace=T)
)
# A solution without Winsorize based solely on dplyr
library(dplyr)
df %>%
mutate(outcome05 = quantile(outcome, probs = 0.05), # 5% quantile
outcome95 = quantile(outcome, probs = 0.95), # 95% quantile
outcome = ifelse(outcome <= outcome05, outcome05, outcome), # replace
outcome = ifelse(outcome >= outcome95, outcome95, outcome)) %>%
select(-c(outcome05, outcome95))
您可以根据您的具体问题进行调整。
假设“时间段”指的是 'commission'
列,您可以使用 ave
.
transform(dat, conflicts_w=ave(conflicts, commission, FUN=DescTools::Winsorize))
# type CD_MUN year time_to_t conflicts commission conflicts_w
# 1 manif 1100023 2000 -11 1 2011 1.05
# 2 manif 1100189 2000 -3 2 2003 2.00
# 3 manif 1100205 2000 -9 5 2009 5.00
# 4 manif 1500602 2000 -4 1 2004 1.00
# 5 manif 3111002 2000 -11 2 2011 1.95
# 6 manif 3147006 2000 -10 1 2010 1.00
数据:
dat <- structure(list(type = c("manif", "manif", "manif", "manif", "manif",
"manif"), CD_MUN = c(1100023L, 1100189L, 1100205L, 1500602L,
3111002L, 3147006L), year = c(2000L, 2000L, 2000L, 2000L, 2000L,
2000L), time_to_t = c(-11L, -3L, -9L, -4L, -11L, -10L), conflicts = c(1L,
2L, 5L, 1L, 2L, 1L), commission = c(2011L, 2003L, 2009L, 2004L,
2011L, 2010L)), class = "data.frame", row.names = c(NA, -6L))